# Week 6: Pandas overview

First let's import the `pandas` module

In [None]:
import pandas as pd
import numpy as np
import os
pd.options.display.float_format = '{:,.2f}'.format

### Data Series

In [None]:
s1 = pd.Series(data=np.arange(100, 200, 5))
s1

In [None]:
s2 = pd.Series(data=['data_{}'.format(i) for i in range(20)])
s2

## Data Frames

**Exercise:** create a Data Frame using the two dataseries above, placing `s2` in the first column. Name the `s2` column `titles` and the `s1` column `numbers`.

In [None]:
# Write your solution here


### Import Data Frame from CSV file (IMDB movies)

In [None]:
rel_path = os.path.join("..","datasets","imdb-movies.csv")
abs_path = os.path.abspath(rel_path)
print(f"Relative path is {rel_path}")
print(f"Absolute path is {abs_path}")

In [None]:
# Get info about the current working directory
os.getcwd()

In [None]:
imdb_movies = pd.read_csv(abs_path)

In [None]:
print(imdb_movies.ndim)
print(imdb_movies.shape)
imdb_movies.head()

In [None]:
print(imdb_movies.get_dtype_counts())
print(type(imdb_movies.values))
print(type(imdb_movies.values.tolist()))
imdb_movies.describe()

In [None]:
imdb_movies.index

In [None]:
old_cols = imdb_movies.columns.values
imdb_movies.columns = [col.lower() for col in imdb_movies.columns]
imdb_movies.columns

Use the `df.dtypes` attributes to see the type of each column.
Remember the object type in Pandas can be a string, a list, a dictionary etc.

In [None]:
imdb_movies.dtypes

### Indexing and Re-indexing

Let's set the column `rank` as the index of the DataFrame

In [None]:
imdb_movies.columns = old_cols
imdb_movies.set_index("Rank")

In [None]:
imdb_movies.head()

Nothing has happened...why? Because the DataFrame is not modified inplace. Rather, a new data frame is returned. If the want to modify the input DataFrame we must do:

In [None]:
imdb_movies.set_index("Rank", inplace=True)
imdb_movies.head()

## Rows selection

In [None]:
imdb_movies[10:20]

In [None]:
# select every fifth row from 0 to one hundred 
imdb_movies[4:100:5]

## Column selection

In [None]:
actors = imdb_movies["Actors"]
type(actors)

If you slice a single column, you are retuned a Pandas Series, which is a 1-D version of a DataFrame, just as a vector is a 1-D version of a matrix, if you wish.

In [None]:
details = imdb_movies[["Title", "Genre", "Director"]]
print(type(details))
details.head()

## Data Selection

In [None]:
## Selection by index
imdb_movies.loc[5]

In [None]:
## Selection by position of the row
imdb_movies.iloc[4]

In [None]:
## Selection by index with filtering
imdb_movies.loc[[5, 10, 15], ["Title", "Year"]]

In [None]:
##### Selection by index with filtering
imdb_movies.iloc[[4, 9, 14], [0, 3]]

### Filtering Pandas DataFrame

In [None]:
low_rating = imdb_movies.Rating < 6.0
low_rating.head()

In [None]:
imdb_movies[low_rating]

**Exercise:** filter the dataframe so to find all films with runtime between 90 and 140 minutes.

In [None]:
# Write your solution here


**Exercise:** now find all the films shorther than 90 mins or longer than 140

In [None]:
# Write your solution here


Let's find all movies run by certain three famous directors: Christopher Nolan, Martin Scorses and Peter Jackson.

In [None]:
director_cond_00 = (
    (imdb_movies.Director == 'Christopher Nolan') | 
    (imdb_movies.Director == 'Martin Scorsese') | 
    (imdb_movies.Director == 'Peter Jackson')
)

In [None]:
imdb_movies[director_cond_00]

This can also be done by using the `pd.Series.isin()` method:

In [None]:
# rewrite the condition above using `.isin()`
director_cond_01 = ...

In [None]:
imdb_movies[director_cond_01]

In [None]:
# You can achieve the same result using the `.loc` method
imdb_movies.loc[director_cond_01]

## IsNull, IsNa, NotNa, DropNa, FillNa

This can be useful to find missing values in your dataset

In [None]:
is_missing_revenue = imdb_movies['Revenue (Millions)'].isnull()
imdb_movies[is_missing_revenue]

In [None]:
is_missing_revenue = imdb_movies['Revenue (Millions)'].isna()
imdb_movies[is_missing_revenue]

In [None]:
with_revenue = imdb_movies['Revenue (Millions)'].notna()
imdb_movies[with_revenue]

Using `.dropna()` rows or coluns with missing values can be dropped.

Using `.fillna()` missing values can be replaced with specific values. This is called imputation. A common approach is to use the mean or the median value for numerical features. What about categorical ones?

### Sorting results

In [None]:
movies_with_revenue = imdb_movies[with_revenue]
movies_with_revenue.sort_values(by=['Revenue (Millions)'])

In [None]:
movies_with_revenue = imdb_movies[with_revenue]
movies_with_revenue.sort_values(by=['Rating', 'Votes'])

In [None]:
movies_with_revenue = imdb_movies[with_revenue]
movies_with_revenue.sort_values(by=['Revenue (Millions)'], ascending=False)

### Concatenating two DataFrames

In [None]:
books_0 = [
    ('Tolkien', 'The Hobbit', 1937, 1220),
    ('Tolkien', 'The Lord of the Rings', 1966, 1220),
    ('Rowling', 'Harry Potter and the Goblet of Fire', 2007, 660),
    ('Rowling', 'Harry Potter and the Deathly Hallows', 2007, 660),
    ('Rowling', 'Fantastic Beasts and where to Find Them', 2007, 660),
]

books_1 = [
    ('James', 'The Turn of the Screw', 1898, 121),
    ('Pynchon', 'The Crying of Lot 49', 1966, 152),
    ('Simmons', 'Hyperion', 1989, 500)
]

In [None]:
books_0_df = pd.DataFrame.from_records(
    data=books_0,
    columns=['author', 'title', 'publication_year', 'page_count']
)
books_1_df = pd.DataFrame.from_records(
    data=books_1,
    columns=['author', 'title', 'publication_year', 'page_count']
)

In [None]:
books_0_df

In [None]:
books_1_df

We want to concatenate the two Data Frames vertically

In [None]:
# complete this replacing the ellipsis
books_df = pd.concat(...)

### Merge two DataFrames

In [None]:
authors = [
    ('Thomas', 'Pynchon', 1936, None),
    ('J.K.', 'Rowling', 1965, None),
    ('J.R.R.', 'Tolkien', 1892, 1973),
    ('James', 'Joyce', 1882, 1941)
]

authors_df = pd.DataFrame.from_records(
    data=authors,
    columns=['name', 'surname', 'birth_year', 'death_year']
)

In [None]:
books_df

In [None]:
authors_df

We want to combine rows from the two DataFrames (`books_df` and `authors_df`), based on a related column between them.

What are the related colums here?

In [None]:
# complete this filling the ellipsis
# Have a look at the merge() function documentation
pd.merge(...)

### Import a JSON dataset

In [None]:
file_path = "../datasets/walking-dead-tv-series.json"
temp_df = pd.read_json(file_path)
temp_df.head()

<b>Data preprocessing:<b>

**Exercise:** 
    * The `_links` are unstructured and stored as a dictionary. How can we extract them as strings in a new column?
    * the runtime is in  minutes, convert it to seconds.

In [None]:
# Write your solution here...