# III - Data Analysis

## III.2. Pandas 🐼

---

![](images/red-panda.jpg)

> Pandas is a **Python library** made for **data manipulation and analysis**. Even though it's quite a young library, it has been proven to be very useful for any data scientist!

![](images/pandas_logo.png)

> 📚 **Resources**: This *cheatsheet* might come handy for remembering essential methods: [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

---

# I. Introduction to Pandas

## I.1. DataFrames

Let's first import the library. By convention we use the alias `pd` that appears handy when calling the library.

In [None]:
import pandas as pd

The main object, unveiling the strength of Pandas are `DataFrames`.

It is a tabular-like structure made of rows and columns.

We can create DataFrame manually or we can load existing data (in .csv files for example) directly into it.

In [None]:
# We can create a DataFrame manually
df = pd.DataFrame({"name": ["Einstein", "Turing", "Nash"],
                   "birth_year": [1879, 1912, 1928]})
df

In [None]:
# We can create a DataFrame by loading the directly directly from a .csv file for example
df = pd.read_csv('input/imdb_1000.csv')

Once the data is loaded, you can do visualize/manipulate it quite easiliy, and Pandas contains multiple powerful methods for this new object. 

You can check the content of the `n` first lines:

In [None]:
# Now you can have a peak view at the first rows
df.head()

Or display the `n` last lines:

In [None]:
df.tail(n=8)

As you can see, the DataFrame is an object more "enriched" than raw Numpy arrays. It is therefore both more powerful but also more expensive (in size and computing power).

A DataFrame contains indeed more information (metadata) than just the raw data:
- it references the **columns names**
- it references the table **index** (or several if you work with "multi-indexes"), in bold on the left. It corresponds to the index of the line
- it references null (N/A or empty) values

By default, when creating a new DataFrame, Pandas instanciates a new incremental index but you could also specify your custom index.

In [None]:
# METHOD 1: Reading csv and indicating index column
df = pd.read_csv('input/imdb_1000.csv', index_col="title")

# METHOD 2: Reading csv then setting index manually
#df = pd.read_csv('input/imdb_1000.csv')
#df = df.set_index("title")

df.head()

In [None]:
# METHOD 2: Reading csv then setting index manually
df = pd.read_csv('input/imdb_1000.csv')
df = df.set_index("title")

df.head()

Again, DataFrames class contain many useful methods and attributes to retrieve information about the data (metadata):

In [None]:
# `shape` attribute to retrieve (n_rows, n_columns)
df.shape

In [None]:
# `info` method to retrieve a summary of the columns metadata
# (non-null objects, column name, dtype, etc.)
df.info()

We can also retrieve the columns names and the rows index:

In [None]:
# `columns` attribute
df.columns

In [None]:
# `index` attribute
df.index

## I.2. Data statistics

We can quickly retrieve data statistics with `describe` method

In [None]:
# `describe` method allows you to retrieve statistics per column
# (only for columns with numerical values)
df.describe()

We could also retrieve any of those values manually.

In [None]:
# Mean value
print('Mean duration:', df['duration'].mean())
# Min and Max values
print('Min and Max star_rating:', df['star_rating'].min(), df['star_rating'].max())

We can also:
- get the number of unique values of a column using `nunique`
- get the list of those unique values using `unique`
- get the number of rows per value using `value_counts`

> 🔦 **Hint**: This will be particularly useful for **categorical variables**!

In [None]:
# The number of unique values in a column
df['genre'].nunique()

In [None]:
# The unique values of a column
df['genre'].unique()

In [None]:
# The number of rows for each value is the following
df['genre'].value_counts()

---

# II. DataFrame manipulation

## II.1. One column: Series

If we want to retrieve one column, we can specify it into brackets.

> ⚠️ **Warning**: Given a DataFrame `df` and a column named `col`, you can access this column using `df[col]` but also using the syntax `df.col`. However this new notation does not work:
- when you create a new column (while the first one does)
- when the name of the column contains spaces

In [None]:
df['content_rating']

What is the type of this column?

In [None]:
type(df['content_rating'])

Interesting. So in summary in Pandas' world:
- a table is a **Pandas DataFrame**
- a column of a DataFrame is a **Pandas Series**.

Again, a Pandas Series contains additional information (than just raw data) such as row Index.

You can therefore retrieve:
- the raw data using `values` attribute.. back to a good old NumPy array 🙂
- the row index using `index` attribute

In [None]:
df['content_rating'].values

Let's check the type to be sure:

In [None]:
print(type(df['content_rating'].values))

## II.2. Two or more columns: DataFrame

Now what if we want to collect multiple columns at the same time?

In [None]:
# The syntax is slightly different this time: two brackets needed: we select a list of columns
df[["duration", "genre"]]

Therefore, selecting multiple columns from a DataFrame gives back a DataFrame (it is simply the DataFrame with less numbers of columns than originally) 

## II.3. Selecting data

We just explained how to select a subset of one or more columns from a DataFrame.

What if we want to select specific rows also? Let's explore the different methods and tools to **select** specific data.

There are two main ways for doing it: by using **indices** with `iloc` or by using **keys** with `loc`.

### II.3.A. `iloc`

Since a DataFrame is like a tabular, you might want to select data using **indices** (0 for first row, 1 for second row, -1 for last row, etc.). 

In order to do so, you can use the method `iloc`.

In [None]:
# Rows:
df.iloc[0] # First row of DataFrame
df.iloc[1] # Second row of DataFrame
df.iloc[-1] # Last row of DataFrame

# Columns:
df.iloc[:,0] # First column of DataFrame
df.iloc[:,1] # Second column of DataFrame
df.iloc[:,-1] # Last column of DataFrame

In addition, you can also select specific columns (by using indices inside your `iloc` method, or by using keys outside of the `iloc` method).

In [None]:
# Multiple row and column selections using iloc and DataFrame
df.iloc[0:5] # First five rows of dataframe
df.iloc[:, 0:2] # First two columns of data frame with all rows
df.iloc[[0,3,6,24], [0,2]] # 1st, 4th, 7th, 25th row + 1st and 3rd columns
df.iloc[:5, 0] # First 5 rows and first column of DataFrame

In [None]:
# Getting actors_list of first row
# METHOD 1
print(df.iloc[0,4])

# METHOD 2 - as df.iloc[0] is a Series, you can also get the actors_list like this
print(df.iloc[0]["actors_list"])

### II.3.B. `loc`

Now if you prefer to select data using indexes (for rows or columns or both), then you can use `loc` instead.

In [None]:
# If you prefer to use the keys, then you need to use loc:
df.loc["The Usual Suspects"]

In [None]:
df.loc["The Usual Suspects", "actors_list"]

In [None]:
df.loc["Blue Valentine":, ["genre", "duration"]]

### II.3.C Filtering data using boolean conditions

An extremely powerful technique, that we introduced with NumPy arrays consists of **filtering data based on boolean condition**. 

You just need to use `loc` method described above combined with a boolean condition.

In [None]:
# Let's retrieve only "Drama" movies
drama_movies_bool = df['genre'] == "Drama"
drama_movies_bool.head()

In [None]:
drama_movies = df.loc[drama_movies_bool]
drama_movies.head()

Again, in one line this time, let' retrieve all names and associated ratings of movies with a score superior to 9

In [None]:
top_movies = df.loc[df["star_rating"] > 9, ["star_rating"]]
top_movies

---

# III. Data modification

## III.1. Adding/removing columns & rows

We can add and remove column(s) quite easily with Pandas. Let's create three new columns:
- **`type`**: containing the string "movies" in every row
- **`long_movie`**: referencing if a movie lasted more than 180 minutes
- **`main_actor`**: containing the first actor contained in `actors_list`

In [None]:
df.head()

In [None]:
# 1. Column with a fixed value 
df["type"] = "movie"

# 2. Column based on an other column
df["long_movie"] = df["duration"] > 160

# 3. Column created with a lambda function
# We need to use eval function as actors_list contains strings
df["main_actor"] = df["actors_list"].apply(lambda x: eval(x)[0])

df.head()

**`type`** column is quite useless, let's remove it.

In [None]:
df = df.drop(['type'], axis=1) # axis=1 corresponds to column, axis=0 corresponds to row
df.head()

## III.2. Handling missing values

Now we know how to create and handle dataframes, but why is it so helpful for data scientists? Well, let's see how it can be helpful to handle missing value on a new dataset:

In [None]:
# Let's import a new dataset
new_df = pd.read_csv('input/class-grades.csv')
new_df.head(10)

Okay we have some **`?`** values, but what does **`info()`** tell us about it?

In [None]:
# Unfortunately, it doesn't tell us a lot about it!
new_df.info()

We can fix this using the following command to replace **`?`** values by **`NaN`**:

In [None]:
new_df = pd.read_csv('input/class-grades.csv', na_values=['?'])

In [None]:
# Now, we can check that the number of non-null values has been properly updated!
new_df.info()

So there are **`NaN`** in our table. We can choose to remove them for example:

In [None]:
# Well we have some NaN in our data
# A naive approach would be to remove them
drop_df = new_df.dropna()
drop_df.info()

---

## III.3. Concatenation and merging

### III.3.A. `concat()`

In [None]:
# Let's create two series
s1 = pd.Series(['apple', 'orange', 'banana'],
               index=[1, 2, 4])
s2 = pd.Series(['pineapple', 'wildberry', 'raspberry'],
               index=[3, 2, 6])

print(s1)
print(s2)

And let's try to concatenate them:

In [None]:
# What if we concatenate them?
pd.concat([s1, s2], axis=0)

Okay that worked, but you see the indices are not necessarily ordered, so be careful when you concatenate!

Here we made the concatenation on the axis 0 (i.e. the first axis, the rows). We can also make the concatenation on the 1 axis (i.e. the columns):

In [None]:
# What if we concatenate them?
pd.concat([s1, s2], axis=1)

In this case, the index was used to perform the concatenation: again, be careful when you have to concatenate, you can easily end up with NaN values!

### III.3.B. `merge()`

In [None]:
# Let's create two new dataframes
meal = ['pizza', 'pasta', 'burger']
prices = [11.8, 12.9, 15.60]
calories = [870, 790, 950]

prices = pd.DataFrame({'meal': meal, 'prices': prices})
calories = pd.DataFrame(calories, index=meal, columns=['calories'])

In [None]:
prices

In [None]:
calories

In [None]:
# We can merge them in a smart way:
prices.merge(calories, left_on='meal', right_index=True)