# Pandas Exercises

We will be working with a Dataset from the [Internet Movie Database](https://www.imdb.com). The dataset comes from [Kaggle](https://www.kaggle.com/PromptCloudHQ/imdb-data/data). Since you need an account there to download, the datafile is provided in ILIAS for your convenience. The copied file was slightly altered to make this exercise more interesting.

Import Numpy and Pandas with the usual convention

Click on the three dots below to disply the solution

In [None]:
import numpy as np
import pandas as pd

Load the file into a Dataframe named df.

In [None]:
# df = ...

Click on the three dots below to disply the solution

In [None]:
df = pd.read_csv('IMDB-Movie-Data.csv')

Show the dimensions (the shape) of the dataframe

Click on the three dots below to disply the solution

In [None]:
df.shape

Now display the fist 10 rows of the dataframe

Click on the three dots below to disply the solution

In [None]:
df.head(n=10)

Also check the last 5 entries.

Click on the three dots below to disply the solution

In [None]:
df.tail()

That last line does not contain any data. Remove it. 

Click on the three dots below to disply the solution

In [None]:
# any of the following works (but execute only one of them)
df = df.iloc[:-1, :] # select all rows except the last by position
#df = df.loc[:999, :] # select all rows except the last by index
#df.drop(1000, axis='index', inplace=True) # drop last row inplace
#df = df.drop(1000) # drop last row and assign resulting dataframe to df

List all the columns of the DataFrame

Click on the three dots below to disply the solution

In [None]:
df.columns

And now list the index

Click on the three dots below to disply the solution

In [None]:
df.index

List only the Years

Click on the three dots below to disply the solution

In [None]:
df.Year # or df['Year'] or df.loc[:, 'Year']

What is the time our data spans?

Click on the three dots below to disply the solution

In [None]:
df.Year.min(), df.Year.max()

Now check if the datatypes make sense (list them)

Click on the three dots below to disply the solution

In [None]:
df.dtypes

In fact, we can drop the Rank column, as it simply is a sort of Id which we don't need.

Click on the three dots below to disply the solution

In [None]:
df.drop('Rank', axis='columns', inplace=True)

In [None]:
df.head()

Ok. So the columns Genre and Actors contain multiple values. Just like with databases, we want our dataframe to adhere to the [1st Normal Form](https://en.wikipedia.org/wiki/First_normal_form) and thus need to separate these columns into their own Dataframes or Series.

First use the .str accessor and its split() method to create a Series where each element is a list of genres.

In [None]:
# genres = ...
# genres.head()

Click on the three dots below to disply the solution

In [None]:
genres = df.Genre.str.split(',')
genres.head()

You now have a Series of lists. The following code generates a mapping table that has one entry per movie/genre combination. Try to understand the statement by taking it apart, executing the commands one-by-one and reading their documentation.

In [None]:
genres = genres.apply(pd.Series).stack().reset_index().iloc[:, [0,2]]
genres.columns = ['movie_id', 'genre']
genres.head()

Now do the same for the actors.

Click on the three dots below to disply the solution

In [None]:
actors = df.Actors.str.split(',').apply(pd.Series).stack().reset_index().iloc[:, [0,2]]
actors.columns = ['movie_id', 'actor']
actors["actor"] = actors.actor.str.strip() #This is neede to remove whitespaces whih duplicate the actors
actors.head()

We don't need the original columns Genre and Actors from the df DataFramwe anymore. Drop them.

Click on the three dots below to disply the solution

In [None]:
df.drop(['Genre', 'Actors'], axis=1, inplace=True)

Ok, so now our three DataFrames are ready for some analysis!

In [None]:
df.head(n=3)

In [None]:
actors.head(n=3)

In [None]:
genres.head(n=3)

Which are the movies with the longest and shortest runtime?

Click on the three dots below to disply the solution

In [None]:
# which boolean indexing (can also be written in one line):
shortest_runtime = df['Runtime (Minutes)'].min()
df[df['Runtime (Minutes)'] == shortest_runtime]

In [None]:
# with sorting:
df.sort_values(by='Runtime (Minutes)', ascending=False).head(n=1)

Which is the best movie, measured by the metascore?

Click on the three dots below to disply the solution

In [None]:
df.sort_values(by='Metascore', ascending=False).head(n=1)

Which is the longest description?

Click on the three dots below to disply the solution

In [None]:
index_of_longest = df.Description.str.len().sort_values(ascending=False).head(n=1)
df.loc[index_of_longest.index, 'Description'].values

# Alternative solution
# df.loc[df.Description.str.len().idxmax].Description

What is the average metascore?

Click on the three dots below to disply the solution

In [None]:
df.Metascore.mean()

Sort the genres by popularity.

Click on the three dots below to disply the solution

In [None]:
genres.genre.value_counts()
# or manually: genres.groupby('genre').count().sort_values('movie_id', ascending=False)

Which director is the most productive by number of movies?

Click on the three dots below to disply the solution

In [None]:
df.Director.value_counts().head()

Which director is the most productive by revenue?

Click on the three dots below to disply the solution

In [None]:
df.groupby('Director')['Revenue (Millions)'].sum().sort_values(ascending=False).head()

Which actors have acted most often in the year 2012?

Click on the three dots below to disply the solution

In [None]:
joined = actors.join(df, on='movie_id')
joined.loc[joined.Year==2012, 'actor'].value_counts().head()

Make sure you remember the function value_counts(), you'll use it a lot!

Now, explore the dataset a bit more and think of some questions you can answer.

In [None]:
# ...

### Well done!
If you understood most of the concepts, you are ready for the Machine Learning course.

Below is some additional material, if you want to go on.
* [Pythonchallenge](http://www.pythonchallenge.com/): Learn Python by solving increasingly hard riddles. Fun!
* [Project Euler](https://projecteuler.net/): Learn Python by solving increasingly hard programming problems.
* [More Pandas Exercises](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (see the Cookbook exercises)
* [More Python Exercises](https://github.com/jerry-git/learn-python3) (these start out easy but cover a lot of the language)