# <b>CaRM Module: Advanced Topics in Data Preparation Using Python (2024/2025)</b>
## <b>Session 1: Introduction to Pandas and DataFrames.</b>

### <b>Introduction to other data structures such as dictionaries, series and dataframes.</b>



#### <b>1.1. Dictionaries</b>

A Python <b>Dictionary</b> is another data structure, which can store a collection of objects. However, a particular characteristic of dictionaries is that they store objects as key-value pairs. To create a dictionary, enclose a comma-separated list of key:value pairs in curly braces. Note there is a colon (:) between each key and its associated value. For example:

In [None]:
myDict = {'fruits': ['apple', 'pear', 'orange'], 'animals': ['dog', 'cat', 'cow'], 'tools': ['hammer', 'drill', 'saw']}
print(myDict)

A dictionary can contain values of any datatype. The keys can be integers (not just strings), but should be unique:

In [None]:
myDict = {1: ['apple', 'pear', 'orange'], 'numbers': [1,2,3,4], 'tools': ['hammer', 'drill', 'saw']}
print(myDict)

The following code creates a dictionary with only two key-value pairs because the third key is repeated. Values with a duplicate key will overwrite existing values. Try it:

In [None]:
myDict = {1: ['apple', 'pear', 'orange'], 'numbers': [1,2,3,4], 1: ['hammer', 'drill', 'saw']}
print(myDict)

There are different ways to access the values in a dictionary. The simplest way is to use its key name, inside square brackets:

In [None]:
print(myDict[1])
print(myDict['numbers'])

Dictionaries are very useful to store associated information, for example, to connect participants IDs with their survey files. We will take advantage of the associative feature of dictionaries later in this Module.

#### <b>1.2. Series<b>

<b>Create a Series from a list:</b>

In [None]:
import pandas as pd # remember, pd is the alias for pandas

myList = ['a','b','c']
mySeries = pd.Series(myList, name='numbers')

print(mySeries)
print(mySeries.name)


<b>Create a Series from a dictionary:</b>

In [None]:
import pandas as pd # you can omit this line if you have imported it

myDict = {1:'a', 2:'b', 3:'c'}
mySeries = pd.Series(myDict)

print(mySeries)


<b>Accessing the elements of a Series</b>

In [None]:
import pandas as pd

mySeries = pd.Series(
    [1,2,3,4], 
    name='numbers'
    )

# retrieve first element 
print('The first element is:')
print(mySeries[0])

# get the first two elements
print('\nThe first two elements are:')
print(mySeries[:2])

# get the last element
print('\nThe last element is:')
print(mySeries[-1:])

The indexes of a Series can also be strings. Elements can be accessed using the index label.

In [None]:
import pandas as pd

mySeries = pd.Series(
    [1,2,3,4], 
    name='numbers',
    index=['a','b','c','d'] # you can add labels to the indexes
    )

print(mySeries)
print(mySeries['a':'c']) # compare this output
print(mySeries[0:2]) # to this output

#### <b>1.3. DataFrames<b>

A Pandas <b>DataFrame</b> is a two-dimensional data structure, like a table with rows and columns. It is possible to create a simple DataFrame from a dictionary.

In [None]:
import pandas as pd # import pandas library using the alias pd

# create a Python dictionary
myDict = {'fruits': ['apple', 'pear', 'orange'], 'animals': ['dog', 'cat', 'cow'], 'tools': ['hammer', 'drill', 'saw']}
print('The dictionary below...')
print(myDict)

# create a DataFrame object from the dictionary:
df = pd.DataFrame(myDict)
print('\n...is converted into a sort of table (the dataframe):')
print(df) 

You could also create a DataFrame from Series:

In [None]:
import pandas as pd

mySeries1 = pd.Series(
    ['apple', 'orange', 'pear', 'melon'], 
    name= 'fruits',
    )

mySeries2 = pd.Series(
    ['dog', 'cat', 'cow', 'bird'], 
    name= 'animals',
    )

list_of_series = [mySeries1, mySeries2]
myDf = pd.concat(list_of_series, axis=1)

print(myDf)

It is also possible to extract columns of a DataFrame as Series:

In [None]:
import pandas as pd

myDict = {'fruits': ['apple', 'orange', 'pear', 'melon'], 'animals': ['dog', 'cat', 'cow','bird']}
myDf = pd.DataFrame(myDict)

mySeries1 = myDf['fruits']
print(mySeries1)

mySeries2 = myDf['animals']
print(mySeries2)

### <b>Mission 1.</b> 
#### <b>1.4. Reading csv files with Pandas. DataFrame attributes for inspecting the data.</b>


In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')
print('\nThis is the shape of the dataframe:')
print(df.shape) # first number is the number of rows, second number is the number of columns
print('\nThese are the column names of the dataframe:')
print(df.columns) #.to_list()
print('\nThese are the indexes of the dataframe:')
print(df.index) #.to_list()
print('\nThese are the data types of the columns of the dataframe:')
print(df.dtypes)

#### <b>1.5. DataFrame methods for inspecting the data.</b>


In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print('\nThe method .head(n) returns the first few (n) rows:')
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 100) # changes the width of the window where the data is displayed
print(df.head())
#print(df)


In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print('\nThe method .tail(n) returns the last few (n) rows:')
print(df.tail()) 

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print('\nThe method .sample(n) returns a few (n) rows, selected in random order:')
print(df.sample(5)) 

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# prints range index (number of rows), number of each column, column labels, 
# number of cells in each column (non-null values), column data types, and memory usage:
print('\nThe method .info(n) prints information on each of the columns:')
print(df.info())

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print('\nThe method .describe(n) calculates a few summary statistics for each column:')
print(df.describe())

#### <b>1.6. Other useful methods for columns or variables with categorical and string values.</b>

Let's try to find out more about the data types of each variable (column):

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# prints range index (number of rows), number of each column, column labels, 
# number of cells in each column (non-null values), column data types, and memory usage:
df.info() 


Note that there are only 4 numeric columns in the dataframe. The rest are shown as 'object', but they are probably strings. At this point, it would be convenient to identify which of those columns with text could be categorial values. One strategy could be to explore each column with text and investigate how many unique terms the column has.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print(df.columns) # print the column names
colname = 'sex'
print(f'\nUnique values of column {colname}:')
print(df[colname].unique()) # return unique values of Series objects.
print(f'\nCount for each value in column {colname}:')
print(df[colname].value_counts()) # print a Series containing counts of unique values.
print(f'\nDoes column {colname} has any null values?')
print(df[colname].isnull().any()) #return True if it has missing values, or False if not

Sometimes it is not possible to find a small set of values in a column with strings. However, there are specific terms that repeat. It may be useful to identify the values that contain a specific keyword, so they can be grouped together into a specific category. 

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

print(df.columns) # print the column names
colname = 'occupation'
keyword = 'Jedi'

#print(df[df[colname].str.contains(keyword, case=False)][colname])

for x in df[colname]:
    if keyword in x:
        print(x)

You will also notice that some columns have a sequence of terms separated by comma. Sometimes those terms repeat across rows. It may be useful to identify all the unique terms that appear in those columns. Perhaps, there is a term that can be used to create a categorical variable in a future step of the preprocessing pipeline.

In [None]:
df = pd.read_csv('mission1_data.csv')

#print(df.columns) # print the column names
colname = 'equipment'#'abilities' #'films' #'vehicles'

fulllist = []
for x in df[colname]:
    fulllist.extend(y.strip() for y in str(x).split(','))
    
uniquevalues = list(set(fulllist))
for value in uniquevalues:
    print(value)

Plots for numeric data:

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')
colname = 'height'
df.plot.hist(column=[colname])

colname = 'death_year'
df.plot.hist(column=[colname],by='birth_era')

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

colname = 'sex'
#print(df.groupby([colname]).count())
df.groupby([colname]).count().plot(kind='pie', y='name').legend(bbox_to_anchor=(1.5, 1))

colname = 'species'
df.groupby([colname]).count().plot(kind='pie', y='name').legend(bbox_to_anchor=(1.5, 1))

df.groupby([colname]).mean(numeric_only=True).plot(kind='pie', y='mass').legend(bbox_to_anchor=(1.5, 1))
#df.groupby([colname]).value_counts().plot(kind='pie')