# Pandas

Pandas is a Python package for data import, analysis and manipulation. 

Before we can use pandas, we must first import it into our program (also import numpy).

In [None]:
import pandas as pd
import numpy as np

## Series and DataFrame

Pandas has two data types: Series and DataFrame

We create a Series by passing a sequence (here a list) of values to the `Series` function.

In [None]:
name_list = ['Oleg', 'Jenny', 'Chang', 'Jonas']

series = pd.Series(name_list)

print(series)

Which data type is `series`?

In [None]:
type(series)

A Series has an index attribute. Note the lack of () which differs an attribute from a method/function.

In [None]:
print(series.index)

However, we usually work with two-dimensional data, i.e. several variables for each observation. We can store two-dimensional data in a pandas **DataFrame**.

Define a dictionary with the keys as the column names and the values as the data.

In [None]:
grade_dict = {'Name' : ['Oleg', 'Jenny', 'Chang', 'Jonas'],
              'Score' : [65.0, 58.0, 79.0, 95.0],
              'Pass' : ['yes', 'no', 'yes', 'yes']}

The DataFrame is created by passing the dictionary with column names and values to the `DataFrame` function.

In [None]:
df = pd.DataFrame(grade_dict)

print(df)

In Jupyter, the formatting of a dataframe is better if we omit the print statement:

In [None]:
df

A DataFrame has both an index and a column attribute.

In [None]:
print(df.index)

In [None]:
print(df.columns)

To select a column, we place the column name inside square brackets.

In [None]:
print(df['Name'])

To select multiple columns, we place a list of column names inside square brackets (hence the double brackets: [[ ]])

In [None]:
print(df[['Name', 'Score']])

To select rows, pass the index name(s) to the `loc` function.

In [None]:
print(df.loc[3])

In [None]:
print(df.loc[1:2])

Notice that when selecting multiple rows, it is actually not necessary to use the `loc` function, but without the `loc` function  the end-point is not included.

In [None]:
print(df[1:2])

To slice on both rows and columns, the .loc() method requires a selection of rows and a list of column *names*:

In [None]:
print(df.loc[1:2,['Name','Score']])

The .iloc() selects based on **index** and not column names, but does not include the end-point:

In [None]:
print(df.iloc[1:2,0:2])

## Import and save files

The file `titanic.csv` contains information on all of the passangers of the Titanic.

The file consists of the following data columns:

* PassengerId: Id of every passenger.
* Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
* Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
* Name: Name of passenger.
* Sex: Gender of passenger.
* Age: Age of passenger.

**Alternative 1**: read the file using Python's built-in `open` function, which we have gone through earlier.

In [None]:
titanic = open('titanic.csv', 'r')

for line in titanic:
    print(line)
    
titanic.close()

**Alternative 2** (the preferred): read the file using the `read_csv` function from `pandas`.

In [None]:
titanic = pd.read_csv('titanic.csv')

In [None]:
type(titanic)

In [None]:
titanic

Notice that `read_csv` assumes a comma separator, but this could be customized by adding the sep= parameter. Open the file `titanic_pipe.csv` in a text editor to see the structure of the file. Now, a pipe-delimited version of the file could be read with:

In [None]:
titanic_pipe = pd.read_csv('titanic_pipe.csv', sep = '|')

In [None]:
titanic_pipe

We can write the data set to a spreadsheet format using the `.to_excel` method.

In [None]:
titanic.to_excel('titanic.xlsx')

We can pass arguments to the optional parameters `sheet_name` and `index` (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html for an overview of all the parameters).

In [None]:
titanic.to_excel('titanic.xlsx', sheet_name = 'passengers', index = False)

We can then read the excel file using the `read_excel` function.

In [None]:
titanic = pd.read_excel('titanic.xlsx')

In [None]:
titanic

We can pass a list of column names to the parameter `usecols` in we want to read only a few columns.

In [None]:
titanic2 = pd.read_excel('titanic.xlsx', usecols = ['PassengerId', 'Name', 'Survived'])

In [None]:
titanic2

## Data manipulation

Once data is imported (or created) we usually need to convert the raw data to a format that is suitable for analysis. 

We will cover the following topics: 
- Data exploration 
- Data transformation
- Missing data
- Filtering rows
- Groupby
- Combining data

Please see chapter 3 in the Python Data Science Handbook for a more in-depth treatment of each of these topics (https://jakevdp.github.io/PythonDataScienceHandbook/).

### Data exploration

After a file has been imported, it is important to explore the data to make sure that the file was imported correctly.

The functions `head` and `tail` show the five first and five last rows of the DataFrame.

In [None]:
titanic.head()

In [None]:
titanic.tail()

An argument can be supplied to change the number of rows returned:

In [None]:
titanic.head(9)

In [None]:
titanic.tail(9)

The function `info` displays the data types of the columns (notice that 'object' indicates a string).

In [None]:
titanic.info()

The function `describe` provides descriptive statistics for the **numeric** columns.

In [None]:
titanic.describe()

The functions `nunique` and `unique` shows the number of unique values and the unique values in a given column.

In [None]:
titanic['Survived'].nunique()

In [None]:
titanic['Survived'].unique()

Alternatively, we can use the function `value_counts`

In [None]:
titanic['Survived'].value_counts()

### Data transformation

**Create new columns:**

We can add new columns to existing DataFrames by assigning new values to a new column...

In [None]:
df

In [None]:
df['Age'] = [19, 18, 20, 22]

In [None]:
df

...or we can create a new column based on an existing column...

In [None]:
df['Score_share'] = df['Score'] / 100

In [None]:
df

...or we can change an existing column (the `astype` function changes the column dtype to either `str`, `float` or `int`).

In [None]:
df['Score'] = df['Score'].astype(int)

In [None]:
df

We can use the `apply` function in order to apply a function on each row in a column.

In [None]:
df['Name_length'] = df['Name'].apply(len)

In [None]:
df

**Drop rows and columns:**

We can drop rows and columns by specifying label names and corresponding
axis in the `drop` function.

- Setting `axis = 0` drops rows.

- Setting `axis = 1` drops columns.

To drop a row:

In [None]:
df.drop(0, axis = 0)

This does not change the data frame:

In [None]:
df

To change the data frame, you have to specify *inplace = True*:

In [None]:
df.drop(0, axis = 0, inplace = True)

To drop a column, specify the column name and axis = 1:

In [None]:
df

In [None]:
df1 = df.drop('Score_share', axis = 1)

In [None]:
df1

We can drop several columns by passing a *list* of column names to the the `drop` function. 

Notice that setting the parameter `inplace` equal to `True` permanently drops the column/row from the original data set.

In [None]:
df1 = df.drop(['Score_share', 'Name_length'], axis = 1)

In [None]:
df1

In [None]:
df

In [None]:
df.drop(['Score_share', 'Name_length'], axis = 1, inplace = True)

In [None]:
df

**Change index:**

We can change the index of a DataFrame by assigning a list with new index values to the `index` of the DataFrame.

Remember: the index must be unique to every observation!.

In [None]:
df.index = ['A', 'B', 'C']

df

We can get the numerical index by using the `reset_index` function.

In [None]:
df1 = df.reset_index()

In [None]:
df1

Setting the parameters `inplace` and `drop` equal to `True` will reset the index of the original DataFrame, and drop the old index from the DataFrame.

In [None]:
df.reset_index(inplace = True, drop = True)

In [None]:
df

### Missing data

Missing data in pandas is denoted by the floating-point value `NaN`, which stand for 'not a number'.

In [None]:
type(np.nan)

In [None]:
df['City'] = ['Bergen', 'Oslo', np.nan]

df

We can count the total NaN in each column in a DataFrame by combining the `isna` and `sum` function.

In [None]:
df.isna()

In [None]:
df.isna().sum()

Sometimes we want to drop rows/columns with missing data from our DataFrame. 

We can do this using the `dropna` function, while specifying which axis we wish to drop from.

In [None]:
df

In [None]:
df.dropna(axis = 0)

In [None]:
df

In [None]:
df.dropna(axis = 1)

Notice that we must set the parameter `inplace` equal to `True` in order to make the changes to the original DataFrame.

In [None]:
df

We can fill-in NaN with another value using the `fillna` function...

In [None]:
df.fillna('missing')

...or using the `replace` function.

In [None]:
df.replace(np.nan, 'missing')

### Filtering rows

We can select a subset of the rows and columns in the DataFrame based on one or several conditions (boolean expressions).

A boolean expression returns a Series of boolean values, indicating whether the expression was true or false for each observation.

In [None]:
df

In [None]:
above_75 = df['Score'] > 75

In [None]:
above_75

This Series of boolean values can be used to filter the DataFrame by placing the condition inside the selection brackets []. This will select the only the rows for which the value is true.

In [None]:
df

In [None]:
df[above_75]

It is more common to place the Boolean expression directly inside the square brackets.

In [None]:
df[df['Score'] > 75]

We can combine filtering rows with column selection.

In [None]:
df[df['Score'] > 75]['Name']

Other boolen operators are <, <=, >=, ==, and !=. In addition, we can use the `isin` function to select on multiple values (similar to the membership operator `in`).

In [None]:
df

In [None]:
name_list = ['Chang', 'Jenny']

In [None]:
df[df['Name'].isin(name_list)]

Notice that we can also filter on *multiple* conditions. 

In [None]:
df

In [None]:
cond1 = df['Age'] < 21
cond2 = df['Pass'] == 'yes'

In [None]:
cond2

In [None]:
df

However, each condition must be surrounded by parentheses, and we have to use the operator `|` (the pipe operator) for 'or' and the operator `&` for 'and'.

In [None]:
df[(df['Age'] < 21) & (df['Pass'] == 'yes')]

In [None]:
df[(cond1) & (cond2)]

In [None]:
df[(cond1) | (cond2)]