<a href="https://colab.research.google.com/github/Sagi15G/de_python_course/blob/main/pandas_transform_and_visualize_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transform your data

This session assumes you went over the pandas lessons in Kaggle: link: https://www.kaggle.com/learn/pandas
* lesson 1: Creating, Reading and Writing: https://www.kaggle.com/residentmario/creating-reading-and-writing
* lesson 2: Indexing, Selecting & Assigning: https://www.kaggle.com/residentmario/indexing-selecting-assigning 
* lesson 3: Summary Functions and Maps: https://www.kaggle.com/residentmario/summary-functions-and-maps
* lesson 4: Grouping and Sorting: https://www.kaggle.com/residentmario/grouping-and-sorting
* lesson 5: Data Types and Missing Values: https://www.kaggle.com/residentmario/data-types-and-missing-values
* lesson 6: Renaming and Combining: https://www.kaggle.com/residentmario/renaming-and-combining 

In [1]:
import pandas as pd

## Basic Skills
Before we start, let's go over some important pandas skills we'll be using:

- Sorting
- Dropping missing values
- Checking for duplicates
- Joining
- Group by


In [None]:
# Define a sample dataframe:

list_of_lists = [
    ['Tom', 10, 'Tel Aviv'], 
    ['Jerry', 15, 'Tel Aviv'], 
    ['Ben', 21, 'Nahariyya'],
    ['Jerry', 22, 'Nahariyya'],
    ['Michal', 25, 'Eilat'],
    ['Maya'],
    ['Maya', 3],
]
  
# Create the pandas DataFrame
df = pd.DataFrame(list_of_lists, columns = ['Name', 'Age', 'City'])

# Look at a sample of the data
df.head()

###**Sort**

Documentation: 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

In [None]:
# Sorting by a numeric column
df = df.sort_values(by=['Age'])
print("Sorting by age:")
print(df, "\n")

# Sorting by a numeric column: high to low
df = df.sort_values(by=['Age'], ascending=False)
print("Sorting by age: high to low")
print(df, "\n")

# Sorting by a textual column
df = df.sort_values(by=['Name'])
print("Sorting by name:")
print(df, "\n")

# Sort by index column
df = df.sort_index()
print("Sorting by index:")
print(df, "\n")

###**Drop missing values**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
# Show nulls
df_v1 = df.isna()
print("Is this value missing from the df?")
print(df_v1, "\n")

# Drop all missing values
# We had missing age and city values, all were dropped
df_v2 = df.dropna()
print("Drop all missing values")
print(df_v2, "\n")

# Drop missing from specific column
df_v3 = df.dropna(subset=['Age'])
print("Drop missing ages only")
print(df_v3, "\n")


###**Check for duplicate values**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

In [None]:
# Do we have duplicated rows? (single answer)
print("Do we have duplicate rows? (returning a single answer)")
boolean_answer = df.duplicated().any()
print(boolean_answer, "\n")

# Do we have duplicated rows? (answer per row)
print("Do we have duplicate rows? (Return a series specifying if a row is a duplicate)")
df_v1 = df.duplicated()
print(df_v1, "\n")

# Do we have duplicated values in a specific column? (column: Name)
print("Do we have duplicated values in the 'Name' column?")
boolean_answer = df.Name.duplicated().any()
print(boolean_answer, "\n")

# What duplicate values does the column "Name" have?
print("What are the duplicated values?")
print(df.Name[df.Name.duplicated()], "\n")


###**Join**
* Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [None]:
# Let's define an additional dataframe

cities = [
    ['Nahariyya', 'North', 58000], 
    ['Tel Aviv-Yafo', 'Center', 435900], 
    ['Beer Sheva', 'South', 205000], 
    ['Eilat', 'South', 52000],
]
  
# Create the pandas DataFrame
df2 = pd.DataFrame(cities, columns = ['City', 'Area', 'Population'])

# Look at a sample of the data
df2.head()

In [None]:
df.head()

In [None]:
# Join each person in df with info about the city where they live

# Inner Join 
print("Inner join: return only rows that exist in both datasets")
print(df.join(df2.set_index('City'), on='City',  how='inner'), "\n")

# Left Join 
print("Left join: return all rows that exist in df, and rows in df2 if they match")
print(df.join(df2.set_index('City'), on='City',  how='left'), "\n")

# Full Outer Join 
print("Outer join: rows that exist in either dataset")
print(df.join(df2.set_index('City'), on='City',  how='outer'), "\n")


###**Group by**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html


In [None]:
# How many people live in a city with a population of > 55K?

# 1. Let's edit df2's city names to align with df1
print("Changing city name to enable matching between datasets:")
df2.City.iloc[1] = 'Tel Aviv'
print(df2, "\n")

# 2.Left join between the datasets
print("Left join df with df2")
df3 = df.join(df2.set_index('City'), on='City',  how='left') 
print(df3, "\n")

#3. Keep rows with population > 55K in dataset
print("Keep values with population > 55K")
df3 = df3[df3['Population'] > 55000]
print(df3, "\n")


# 3. Group by city and count results
print("Group by 'City' and count results")
df3 = df3.groupby(['City']).size()
print(df3, "\n")


## Transforming The Movie Datasets

In the following lesson, we'll use pandas to transform the movies dataset we explored last lesson into a more usable version.

We'll focus on several skills:
* Cleanig our data
* Asking questions about the data
* Changing the data to support analysis
* Using the data to answer our questions

**Loading the data**

last lesson, we loaded each csv into a separate df and examined it. This time let's be more organized by creating one dictionary to store all the different dataframes.

dataframes is a dictionary:
* dataframes['movies'] will contain movies.csv
* dataframes['ratings'] will contain ratings.csv 
and so on

In [2]:
# define a single dictionary to store all the datasets
dataframes = {}

# load each of the datasets into the dictionary
for csv_name in ['ratings', 'tags', 'movies', 'links']:
  dataframes['{}'.format(csv_name)] = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/Sagi15G/de_python_course/main/data/movies_csv/{}.csv'.format(csv_name),sep=',')

# example for accessing a single df:
dataframes['movies'].head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Step 1: Cleaning up the dataset

We want to clean up the 'movies' data so it will be easier to use later on in our analysis. In order to do this, let's look at the data and imagine edge cases it may have. Let's frame these edge cases as questions and answer them. 

Some examples:
* Do we have missing titles? genres?
* Do we have upper/lower case issues with titles?
* Can we have multiple rows for a single movie title?
* Is each movie id unique?
* Do all movies have year information?
* Do all years make sense?
* Do we have the same movie names for different years?

Do you have additional questions you want to ask?

####**Do we have missing titles/genres?**

In [None]:
# do we have missing values
print("Are 'title' values missing?")
print(dataframes['movies'].title.isna().any(), "\n")

print("Are 'genre' values missing?")
print(dataframes['movies'].genres.isna().any(), "\n")

# Can genre values have blank spaces?
print("Are there 'genre' values with blank spaces?")
print(dataframes['movies'][dataframes['movies']['genres'].str.contains(" ")].head(), "\n")

In [None]:
# Let's rename the missing genres as an 'unknown' genre using a lambda function
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

dataframes['movies']['genres'] = dataframes['movies']['genres'].apply(lambda x: x if x != "(no genres listed)" else "unknown")

dataframes['movies'][dataframes['movies']['genres'].str.contains("unknown")].head()

####**Do we have upper/lower case issues with titles?**

In [None]:
# Our titles are case sensitive - let's change that!
dataframes['movies']['title'] = dataframes['movies']['title'].str.lower()

dataframes['movies'].head()

####**Do we have duplicate rows for a movie title?**

In [None]:
# Do we have duplicate rows for a movie title?
print("Is there a movie title with more than one row?")
dataframes['movies'].title.duplicated().any()


In [None]:
# Which rows have duplicated titles?
dataframes['movies'].title.value_counts()

In [None]:
# Let's return dataframe rows where we have duplicated titles:

titles = dataframes['movies']['title']
duplicates = dataframes['movies'][titles.isin(titles[titles.duplicated()])]
duplicates.sort_values(by=["title"])

# The difference seems to be the number of genres!

In [None]:
# Keep one row per duplicated title
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

# Keep only one row for each duplicated value using drop_duplicates
dataframes['movies'] = dataframes['movies'].drop_duplicates(subset=['title'], keep='last')

print("Is there a movie title with more than one row now?")
dataframes['movies'].title.duplicated().any()

# Ideally, we'd do something smarter, like keep the rows with more genre information. You can try it yourself...

####**Is each movie id unique?**

In [None]:
# Is each movie id unique?
dataframes['movies']['movieId'].value_counts().max()

In [None]:
dataframes['movies'].head()

####**Do all movies have year information?**

In [None]:
# we'll add the 'has_year' column with True/False values based on the title string
dataframes['movies']['has_year'] = dataframes['movies']['title'].apply(lambda x: x.split()[-1].strip("()").isnumeric())

# movies that don't have year information
dataframes['movies'][~dataframes['movies']['has_year']]

####**Do all years make sense?**

In [None]:
# Let's add a year column to the dataset 

# Create a year column assuming title format is: <some string> (year)
# If the year string is available, use it, otherwise set year = -1
dataframes['movies']['year'] = dataframes['movies'][['title','has_year']].apply(lambda x: int(x[0].split()[-1].strip("()")) if x[1] else -1,  axis=1)

# Sample results:
dataframes['movies'][dataframes['movies']['has_year']].head()


In [None]:
# what's the minimal year?
has_year = dataframes['movies'][dataframes['movies']['has_year']]
has_year.sort_values(by=['year']).head()

In [None]:
# update the year calculation to only use 4-digit years
def title_contains_year(x):
  year_string = x.split()[-1].strip("()")
  return year_string.isnumeric() and len(year_string) == 4

# recalculate has_year and year
dataframes['movies']['has_year'] = dataframes['movies']['title'].apply(lambda x: title_contains_year(x))
dataframes['movies']['year'] = dataframes['movies'][['title','has_year']].apply(lambda x: int(x[0].split()[-1].strip("()")) if x[1] else -1,  axis=1)

In [None]:
# what's the minimal year now?
has_year = dataframes['movies'][dataframes['movies']['has_year']]
has_year.sort_values(by=['year']).head()

### Step 2: Asking questions about the data

* Do we have movies with foreign titles?
* Can one movie appear with multiple years in our data?
* What is the average rating per movie title?
* Do recent movies have more ratings on avg than older movies?
* Does positive/negative sentiment towards a movie in the 'tags' dataset translate to high/low average rating?

### Steps 3+4: Changing the data to support analysis & Using the data to answer our questions

####**Do we have movies with foreign titles?**

In [None]:
# Do we have movies in other languages?
# Documentation: https://docs.python.org/3/library/stdtypes.html

# Define a function checking whether all values 
def isEnglish(s):
    return s.isascii()

# Define is_english column by applying the function to each tite
dataframes['movies']['is_english'] = dataframes['movies']['title'].apply(lambda x: isEnglish(x))

# Display a sample of the results
print("Non-english titles:")
dataframes['movies'][~dataframes['movies']['is_english']].head()

####**Can one movie appear with multiple years in our data?**

In [None]:
# Create a new column containing the title name with no year
dataframes['movies']['movie_name'] = dataframes['movies'][['title','has_year']].apply(lambda x: " ".join(x[0].split()[:-1]) if x[1] else x[0],  axis=1)

# Sample results:
dataframes['movies'][dataframes['movies']['has_year']].head()

In [None]:
# do we have the same movie names for different years?

movie_names = dataframes['movies'].groupby(['movie_name']).size() 
movie_names.loc[movie_names.values>1].sort_values(ascending=False)

In [None]:
# let's look at an example from the dataset:
dataframes['movies'][dataframes['movies']['movie_name'] == 'hamlet']

####**What is the average rating per movie title?**

In [None]:
dataframes['ratings'].head()

In [None]:
# We'll finally use the ratings dataset! 
# Let's validate all rows have ratings
print("Are there rows with missing ratings?")
print(dataframes['ratings']['rating'].isna().any())

print("What is the min rating in the dataset?")
print(dataframes['ratings']['rating'].min())

print("What is the max rating in the dataset?")
print(dataframes['ratings']['rating'].max())


In [None]:
# get avg rating per movie
print("Get mean rating per movie")
mean_rating = dataframes['ratings'][['movieId', 'rating']].groupby(['movieId']).mean()
print(mean_rating.head(), "\n")

# round avg rating to nearest 0.1 value
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html
print("Rounding to nearest 0.1 value")
mean_rating = mean_rating.round(decimals=1)
print(mean_rating.head(), "\n")

In [None]:
# Match each movie with the avg rating and store in a new dataset

# Left Join (add 'rating' to 'movies' df)
dataframes['combined'] = dataframes['movies'].join(mean_rating, on='movieId',  how='left')

# Rename the 'rating' column to 'mean_rating'
dataframes['combined'] = dataframes['combined'].rename(columns={'rating':'mean_rating'})

# Display a sample of the data
dataframes['combined'].head()

####**Do recent movies have more ratings on avg than older movies?**

In [None]:
# Get number of ratings per movie
print("Get number of ratings per movie")
num_ratings = dataframes['ratings'][['movieId', 'rating']].groupby(['movieId']).size()
num_ratings = num_ratings.to_frame('num_ratings')
num_ratings.head()

In [None]:
# Match each movie with the number of ratings it had

# Left Join to add number of ratings to 'movies' df
dataframes['combined'] = dataframes['combined'].join(num_ratings, on='movieId',  how='left')

# Display a sample of the data
dataframes['combined'].head()

In [None]:
# define movie year buckets

def year_bucket(year):
  if year == -1:
    return 'unknown_year'
  elif year < 1960:
    return '1960-'
  elif year >= 1960 and year < 1970:
    return '1960-1970'
  elif year >= 1970 and year < 1980:
    return '1970-1980'
  elif year >= 1980 and year < 1990:
    return '1980-1990'
  elif year >= 1990 and year < 2000:
    return '1990-2000'
  elif year >= 2000 and year < 2010:
    return '2000-2010'
  else:
    return '2010+'

dataframes['combined']['year_bucket'] = dataframes['combined']['year'].apply(lambda x: year_bucket(x))
dataframes['combined'].head()

In [None]:
dataframes['combined'][['year_bucket', 'num_ratings']].groupby('year_bucket').mean()

####**Does positive/negative sentiment towards a movie in the 'tags' dataset translate to high/low average rating?**

In [None]:
# Let's take a look at the tags data
dataframes['tags'].head()

In [None]:
# Our simplistic hypothesis here would be that we can define a set of positive and negative words. 
# If a "tag" contains a negative word, it conveys negative sentiment towards the movie
# If a "tag" contains a positive word, it conveys positive sentiment towards the movie

# Define a list of words:
negative_words = ["bad", "not", "terrible", "horrible"]
positive_words = ["great", "good", "loved", "like", "fun", "funny"]

In [None]:
# Define functions to determine whether a tag contains a positive or negative word
def contains_negative_words(s):
  for word in s.split():
    if word in negative_words:
      return True
  return False 

def contains_positive_words(s):
  for word in s.split():
    if word in positive_words:
      return True
  return False 

# Creating new columns based on the functions
dataframes['tags']['negative_sentiment'] = dataframes['tags']['tag'].apply(lambda x: contains_negative_words(x)) 
dataframes['tags']['positive_sentiment'] = dataframes['tags']['tag'].apply(lambda x: contains_positive_words(x)) 


In [None]:
# Sample positive sentiment tags
print("Positive sentiment")
print(dataframes['tags'][['tag', 'positive_sentiment']][dataframes['tags']['positive_sentiment']].head(), "\n")

# Sample negative sentiment tags
print("Negative sentiment")
print(dataframes['tags'][['tag', 'negative_sentiment']][dataframes['tags']['negative_sentiment']].head(), "\n")

# Rows with positive AND negative sentiment
print("Positive AND negative sentiment")
print(dataframes['tags']['tag'][(dataframes['tags']['positive_sentiment']) & (dataframes['tags']['negative_sentiment'])], "\n")

In [None]:
# let's mark confused rows appropriately
# villain nonexistent or not needed for good story --> positive sentiment
dataframes['tags'].iloc[2496, 4] = False
dataframes['tags'].iloc[2496, :]

In [None]:
# not funny --> negative sentiment
dataframes['tags'].iloc[2563, 5] = False
dataframes['tags'].iloc[2563, :]

In [None]:
# How many tags with positive/negative sentiment does the movie have?

# Defining a dataset with positive tags only
print("Movies with positive tags:")
positive_tags =  dataframes['tags'][dataframes['tags']['positive_sentiment']]
# Grouping by movie and counting
num_positive_tags = positive_tags.groupby(['movieId']).size()
num_positive_tags = num_positive_tags.to_frame('num_positive_tags')
print(num_positive_tags.head(), "\n")


# Defining a dataset with negative tags only
print("Movies with negative tags:")
negative_tags =  dataframes['tags'][dataframes['tags']['negative_sentiment']]
# Grouping by movie and counting
num_negative_tags = negative_tags.groupby(['movieId']).size()
num_negative_tags = num_negative_tags.to_frame('num_negative_tags')
print(num_negative_tags.head(), "\n")


In [None]:
# Left Join to add positive/negative sentiment to combined data
dataframes['combined'] = dataframes['combined'].join(num_positive_tags, on='movieId',  how='left')
dataframes['combined'] = dataframes['combined'].join(num_negative_tags, on='movieId',  how='left')

In [None]:
# Look at sample data
cols = ['title', 'year', 'mean_rating', 'num_positive_tags', 'num_negative_tags']
dataframes['combined'][cols][(dataframes['combined']['num_positive_tags'] > 0) | (dataframes['combined']['num_negative_tags'] > 0) ].head()

In [None]:
# Fill NaN with 0
dataframes['combined'] = dataframes['combined'].fillna(0)

# Look at sample data again
dataframes['combined'][cols][(dataframes['combined']['num_positive_tags'] > 0) | (dataframes['combined']['num_negative_tags'] > 0) ].head()

In [None]:
# If a move had positive sentiment, did it rate high?

# take movies with positive tags and no negative tags
positives = dataframes['combined'][(dataframes['combined']['num_positive_tags'] > 0) & (dataframes['combined']['num_negative_tags'] == 0)]
positives['mean_rating'].describe()

In [None]:
# if a movie had negative sentiment, did it rate low?

# take movies with negative tags and no positive tags
negatives = dataframes['combined'][(dataframes['combined']['num_negative_tags'] > 0) & (dataframes['combined']['num_positive_tags'] == 0)]
negatives['mean_rating'].describe()

# Visualize

## Basic Skills

Before we start, let's go over some important visualization skills.
We'll use built in capabilities pandas has, but many other libraries are also available to use for creating graphs!

* line chart
* bar chart
* pie chart
* scatter plot

In [None]:
# Define a sample dataframe:

list_of_lists = [
    ['2021-04-01', 60304, 42005, 26067, 70809], 
    ['2021-04-02', 61758, 40294, 25003, 70300], 
    ['2021-04-03', 65336, 49384, 23928, 71500], 
    ['2021-04-04', 59003, 46493, 23049, 69002], 
    ['2021-04-05', 62507, 42287, 24578, 72493], 
    ['2021-04-06', 61984, 45539, 26002, 71394], 
    ['2021-04-07', 62679, 41393, 25993, 70823], 
]
  
# Create the pandas DataFrame
df = pd.DataFrame(list_of_lists, columns = ['ds', 'Instagram', 'Facebook', 'Messenger', 'Whatsapp'])

# Look at a sample of the data
print("Data describing total users by app in Raanana (not really!)")
df.head()

In [None]:
df = df.set_index('ds')
df.head()

### Line Chart
* Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.line.html

In [None]:
# Use pandas to plot a line chart
ax = df.plot.line()

In [None]:
# Let's rotate the x axis labels, they're on top of each other!
ax = df.plot.line(rot=60)


In [None]:
# Plot a line with a specific title and figure size
ax = df.plot.line(title="Total users by app in Raanana (April 2021)", figsize=(10,5), rot=90)

# Add x, y labels to our graph
ax.set_xlabel("date")
ax.set_ylabel("num_users")


### Bar Chart
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

In [None]:
# Use pandas to plot a bar chart
ax = df.plot.bar()

In [None]:
# Add title and figure size
ax = df.plot.bar(title="Total users by app in Raanana (April 2021)", figsize=(10,5))

# add x, y labels
ax.set_xlabel("date")
ax.set_ylabel("num_users")

In [None]:
# plot the bars stacked
ax = df.plot.bar(title="Total users by app in Raanana (April 2021)", figsize=(10,5), stacked=True)

# add x, y labels
ax.set_xlabel("date")
ax.set_ylabel("num_users")

### Pie Charts
* Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html

In [None]:
# Create the pandas DataFrame
df = pd.DataFrame(
      {'number_of_users': [60304, 42005, 26067, 70809]}, 
      index=['Instagram', 'Facebook', 'Messenger', 'Whatsapp']
    )


In [None]:
# Plot a pie chart (built in in pandas)
plot = df.plot.pie(y='number_of_users', figsize=(7, 7 ))


### Scatter Plot
* Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html

In [None]:
# Example from documentation!
# Define a dataframe
df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
                   [6.4, 3.2, 1], [5.9, 3.0, 2]],
                  columns=['length', 'width', 'species'])


In [None]:
# Scatter plot for a graph with x, y axes
ax = df.plot.scatter(x='length', y='width')

In [None]:
# Categorize the dataset by adding a color dimension
ax = df.plot.scatter(x='length', y='width', c='species', colormap='viridis')

##Visualizing Our Dataset
Last lesson we transformed the movies dataset. This lesson we'll ask questions about the data and answer them visually

### **Asking Questions**

We can often ask specific questions about a dataset and then generalize them into questions to be answered with graphs.

* How many movies from 2007 does this dataset have? --> **How many movies are available in the dataset per year?**
* How many movie ratings do we have for old movies?/new movies? --> **How many ratings do we have per year?**
* What is the mean movie rating for old movies?/new movies? --> **What is the mean movie rating per year?**
* Do newer movies have more tags than older movies? --> **How many tags do we have per movie year?**
* Does our dataset have more comedies than action films? --> **How many movies do we have per genre?**
* Are comedies rated higher than action films? --> **What is the mean/median rating per genre?**

###**How many movies are available in the dataset per year?**


In [None]:
dataframes['combined'].head()

In [None]:
# Group a subset of the data by year and count
df = dataframes['combined'][['year']].groupby(['year']).size()
df

In [None]:
# Use pandas to plot a line chart with a title and specific size
ax = df.plot.line(title="Number of Movies per Year", figsize=(10,5))

# Add x, y labels
ax.set_xlabel('year')
ax.set_ylabel('number of movies')

In [None]:
# Oh no! The scale of our graph looks bad because of the -1 values
# Let's remove them and try again
df = df.drop(df.index[0])
df.head()

In [None]:
# Use pandas to plot a line chart with a title and specific size
ax = df.plot.line(title="Number of Movies per Year", figsize=(10,5))

# Add x, y labels
ax.set_xlabel('year')
ax.set_ylabel('number of movies')

In [None]:
# Another option: plot in a pie chart
ax = df.plot.pie(y='number of movies by year', figsize=(10,5))

In [None]:
# Wow that's a bit too much on the eyes!!
# Let's use the year groups we defined last class to make this less intense

df = dataframes['combined'].groupby(['year_bucket']).size()

ax = df.plot.pie(title="number of movies per year_bucket", figsize=(10,5))

In [None]:
# That 'None' on the side is happening because we have a series, not a dataframe, and didn't specify a y column

print("How does this dataset look?")
print(df.head(), "\n")

print("Is it a dataframe?")
print(isinstance(df, pd.DataFrame))

print("Is it a series?")
print(isinstance(df, pd.Series))

In [None]:
# Let's remove it!

ax = df.plot.pie(title="number of movies per year_bucket", figsize=(10,5), label='')

In [None]:
# Let's edit this pie a bit more...

# Add a lenged instead of labels
# Pretty colors

colors_list = ['violet', 'indigo', 'blue', 'green', 'yellow', 'orange', 'red', 'pink'] #rainbow!
ax = df.plot.pie(title="number of movies per year_bucket", figsize=(10,5), labeldistance=None, legend=True, label='', colors=colors_list)

###**How many ratings do we have per movie year?**

In [None]:
# summarize the total ratings per year
df = dataframes['combined'][['year', 'num_ratings']].groupby(['year']).sum()

# remove year = -1
df = df.drop(df.index[0])

# display a sample of the data
df.head()

In [None]:
# use pandas to plot a line chart with a title and specific size
ax = df.plot.line(title="Number of Ratings per Year", figsize=(10,5))

# # add x, y labels
ax.set_xlabel('year')
ax.set_ylabel('number of ratings')

###**What is the mean movie rating per year?**

In [None]:
# Our rating is actually the mean rating per movie, so we'll answer with mean of means...

# Mean ratings per year
df = dataframes['combined'][['year', 'mean_rating']].groupby(['year']).mean()

# Remove year = -1
df = df.drop(df.index[0])

# Display a sample of the data
df.head()

In [None]:
# Use pandas to plot a line chart with a title and specific size
ax = df.plot.line(title="Mean rating per Year", figsize=(10,5))

# Ydd x, y labels
ax.set_xlabel('year')
ax.set_ylabel('mean rating')

###**How many movies do we have per genre?**

In [None]:
# Simple approach: bar chart broken down by genre
print("Grouping dataframe by genre and counting the results:")
df = dataframes['combined'].groupby('genres').size()
print(df.head(), "\n")

# Plotting into a bar chart
print("Creating a bar chart based on results:")
ax = df.plot.bar(title="Number of movies by genre", figsize=(10,5), rot=60)

In [None]:
# So this would work well if each movie had only one genre! 
# But each movie can have several genres! We don't want do break down resluts by genre groups...

In [None]:
# We need to transform our dataset to support this.
# Methodology: 
# 1. Let's define a list of all genres
# 2. Add a column per genre in order to categorize each movie as belonging to the genre or not

# Define a list of genres:
genres = [
  'Action',
  'Adventure',
  'Animation',
  'Children',
  'Comedy',
  'Crime',
  'Documentary',
  'Drama',
  'Fantasy',
  'Film-Noir',
  'Horror',
  'Musical',
  'Mystery',
  'Romance',
  'Sci-Fi',
  'Thriller',
  'War',
  'Western'
]

for genre in genres:
  dataframes['combined']['is_{}'.format(genre.lower())] = dataframes['combined']['genres'].apply(lambda x: True if genre in x else False)


In [None]:
# Let's see a sample of our df now
dataframes['combined'].head()

In [None]:
# So now we can easily plot per category!

# Let's look at 'comedy' movies in a bar chart
ax = dataframes['combined']['is_documentary'].value_counts().plot.bar(title="is_comedy?")


In [None]:
# Let's look at 'comedy' movies in a pie chart
ax1 = dataframes['combined']['is_comedy'].value_counts().plot.pie(title="is_comedy?")

In [None]:
# Let's combine all the true value counts per genre into one new dataframe, and graph with it

In [None]:
# What is dataframes['combined']['is_comedy'].value_counts()?

a = dataframes['combined']['is_comedy'].value_counts()
print("How does this dataset look?")
print(a.head(), "\n")

print("Is it a dataframe?")
print(isinstance(a, pd.DataFrame))

print("Is it a series?")
print(isinstance(a, pd.Series))


In [None]:
# So let's give the index and values a name and turn it into a dataframe
b = pd.DataFrame({'boolean':a.index, 'counts':a.values})

# Let's look at sample data now:
b.head()

In [None]:
# looping over the genres list
for x in range(0, len(genres)):
  # defining a genre based on the list index
  genre = genres[x].lower()
  # per genre, using the is_<genre> colum to count the total movies in it
  genre_df_prep = dataframes['combined']['is_{}'.format(genre)].value_counts()
  # converting the results to a dataframe with "boolean" and "counts" columns
  genre_df_prep = pd.DataFrame({'boolean':genre_df_prep.index, 'counts':genre_df_prep.values})
  # specifying which genre we handled
  genre_df_prep['genre'] = genre
  # if this is the first genre, create a new df, genre_df
  if x == 0:
    # genre df will store True values
    genre_df = genre_df_prep[genre_df_prep['boolean']]
  else:
    temp_df = genre_df_prep[genre_df_prep['boolean']]
    genre_df = pd.concat([genre_df, temp_df])


genre_df.head()

In [None]:
# Plot all genres in a bar plot
ax = genre_df[['genre', 'counts']].set_index('genre').plot.bar(title="Total movies per genre", figsize=(10,5))


In [None]:
# let's get this sorted
sorted_df = genre_df[['genre', 'counts']].set_index('genre').sort_values(by=['counts'])
ax = sorted_df.plot.bar(title="Total movies per genre", figsize=(10,5))

In [None]:
# add x, y labels
ax.set_xlabel("genre")

# removing the legend
ax = sorted_df.plot.bar(title="Total movies per genre", figsize=(10,5), legend=False)

###**What is the mean rating per genre?**

In [None]:
# We basically need to a very similar thing.
# If we have to repeat code to do similar things, let's write a fuction to repeat things for us!

def plot_data_by_genre(title, column_name_in_combined_df, column_name_to_graph, action):
  # looping over the genres list
  for x in range(0, len(genres)):
    # preparing the df
    genre = genres[x].lower()
    genre_df_prep =  dataframes['combined'][dataframes['combined']['is_{}'.format(genre)]]
    if x == 0:
      genre_df = pd.DataFrame(pd.DataFrame({'genre':[genre], column_name_to_graph:[getattr(genre_df_prep[column_name_in_combined_df], action)()]}))
    else:
      temp_df = pd.DataFrame(pd.DataFrame({'genre':[genre], column_name_to_graph:[getattr(genre_df_prep[column_name_in_combined_df], action)()]}))
      genre_df = pd.concat([genre_df, temp_df])
    
  #plotting the data
  sorted_df = genre_df[['genre', column_name_to_graph]].set_index('genre').sort_values(by=[column_name_to_graph])
  ax = sorted_df.plot.bar(title=title, figsize=(10,5))



In [None]:
# Let's run our function

plot_data_by_genre("Mean rating per genre", "mean_rating", "mean_mean_rating", "mean")

In [None]:
# I'm not loving the scale of y axis... 
# Most results are between 2.5-4 but we look at 0-5 values, so differences appear minor
# Let's edit our funciton to support y limits 

def plot_data_by_genre(title, column_name_in_combined_df, column_name_to_graph, action, ylimit=[]):
  # looping over the genres list
  for x in range(0, len(genres)):
    # preparing the df
    genre = genres[x].lower()
    genre_df_prep =  dataframes['combined'][dataframes['combined']['is_{}'.format(genre)]]
    if x == 0:
      genre_df = pd.DataFrame(pd.DataFrame({'genre':[genre], column_name_to_graph:[getattr(genre_df_prep[column_name_in_combined_df], action)()]}))
    else:
      temp_df = pd.DataFrame(pd.DataFrame({'genre':[genre], column_name_to_graph:[getattr(genre_df_prep[column_name_in_combined_df], action)()]}))
      genre_df = pd.concat([genre_df, temp_df])
    
  #plotting the data
  sorted_df = genre_df[['genre', column_name_to_graph]].set_index('genre').sort_values(by=[column_name_to_graph])
  if len(ylimit) > 0:
    ax = sorted_df.plot.bar(title=title, figsize=(10,5), ylim=ylimit)
  else:
    ax = sorted_df.plot.bar(title=title, figsize=(10,5))



In [None]:
plot_data_by_genre("Mean rating per genre", "mean_rating", "mean_mean_rating", "mean", ylimit=[2.5, 4])

In [None]:
# let's run our function with some other parameters!

In [None]:
# Median mean rating
plot_data_by_genre("Median mean rating per genre", "mean_rating", "median_mean_rating", "median")

In [None]:
# Min mean rating
plot_data_by_genre("Min mean rating per genre", "mean_rating", "min_mean_rating", "min")

In [None]:
# Max mean rating
plot_data_by_genre("Max mean rating per genre", "mean_rating", "max_mean_rating", "max")

In [None]:
# And just to show our counts option works with this function too
plot_data_by_genre("Number of movies by genre", "title", "number of movies", "count")