<a href="https://colab.research.google.com/github/Sagi15G/de_python_course/blob/main/pandas_transform_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transform your data

This session assumes you went over the pandas lessons in Kaggle: link: https://www.kaggle.com/learn/pandas
* lesson 1: Creating, Reading and Writing: https://www.kaggle.com/residentmario/creating-reading-and-writing
* lesson 2: Indexing, Selecting & Assigning: https://www.kaggle.com/residentmario/indexing-selecting-assigning 
* lesson 3: Summary Functions and Maps: https://www.kaggle.com/residentmario/summary-functions-and-maps
* lesson 4: Grouping and Sorting: https://www.kaggle.com/residentmario/grouping-and-sorting
* lesson 5: Data Types and Missing Values: https://www.kaggle.com/residentmario/data-types-and-missing-values
* lesson 6: Renaming and Combining: https://www.kaggle.com/residentmario/renaming-and-combining 

In [None]:
import pandas as pd

## Basic Skills
Before we start, let's go over some important pandas skills we'll be using:

- Sorting
- Dropping missing values
- Checking for duplicates
- Joining
- Group by


In [None]:
# Define a sample dataframe:

list_of_lists = [
    ['Tom', 10, 'Tel Aviv'], 
    ['Jerry', 15, 'Tel Aviv'], 
    ['Ben', 21, 'Nahariyya'],
    ['Jerry', 22, 'Nahariyya'],
    ['Michal', 25, 'Eilat'],
    ['Maya'],
    ['Maya', 3],
]
  
# Create the pandas DataFrame
df = pd.DataFrame(list_of_lists, columns = ['Name', 'Age', 'City'])

# Look at a sample of the data
df.head()

###**Sorting**

Documentation: 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

In [None]:
# Sorting by a numeric column
df = df.sort_values(by=['Age'])
print("Sorting by age:")
print(df, "\n")

# Sorting by a numeric column: high to low
df = df.sort_values(by=['Age'], ascending=False)
print("Sorting by age: high to low")
print(df, "\n")

# Sorting by a textual column
df = df.sort_values(by=['Name'])
print("Sorting by name:")
print(df, "\n")

# Sort by index column
df = df.sort_index()
print("Sorting by index:")
print(df, "\n")

###**Drop missing values**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
# Show nulls
df_v1 = df.isna()
print("Is this value missing from the df?")
print(df_v1, "\n")

# Drop all missing values
# We had missing age and city values, all were dropped
df_v2 = df.dropna()
print("Drop all missing values")
print(df_v2, "\n")

# Drop missing from specific column
df_v3 = df.dropna(subset=['Age'])
print("Drop missing ages only")
print(df_v3, "\n")


###**Check for duplicate values**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

In [None]:
# Do we have duplicated rows? (single answer)
print("Do we have duplicate rows? (returning a single answer)")
boolean_answer = df.duplicated().any()
print(boolean_answer, "\n")

# Do we have duplicated rows? (answer per row)
print("Do we have duplicate rows? (Return a series specifying if a row is a duplicate)")
df_v1 = df.duplicated()
print(df_v1, "\n")

# Do we have duplicated values in a specific column? (column: Name)
print("Do we have duplicated values in the 'Name' column?")
boolean_answer = df.Name.duplicated().any()
print(boolean_answer, "\n")

# What duplicate values does the column "Name" have?
print("What are the duplicated values?")
print(df.Name[df.Name.duplicated()], "\n")


###**Join**
* Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [None]:
# Let's define an additional dataframe

cities = [
    ['Nahariyya', 'North', 58000], 
    ['Tel Aviv-Yafo', 'Center', 435900], 
    ['Beer Sheva', 'South', 205000], 
    ['Eilat', 'South', 52000],
]
  
# Create the pandas DataFrame
df2 = pd.DataFrame(cities, columns = ['City', 'Area', 'Population'])

# Look at a sample of the data
df2.head()

In [None]:
# Join each person in df with info about the city where they live

# Inner Join 
print("Inner join: return only rows that exist in both datasets")
print(df.join(df2.set_index('City'), on='City',  how='inner'), "\n")

# Left Join 
print("Left join: return all rows that exist in df, and rows in df2 if they match")
print(df.join(df2.set_index('City'), on='City',  how='left'), "\n")

# Full Outer Join 
print("Outer join: rows that exist in either dataset")
print(df.join(df2.set_index('City'), on='City',  how='outer'), "\n")


###**Group by**
* Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html


In [None]:
# How many people live in a city with a population of > 55K?

# 1. Let's edit df2's city names to align with df1
print("Changing city name to enable matching between datasets:")
df2.City.iloc[1] = 'Tel Aviv'
print(df2, "\n")

# 2.Left join between the datasets
print("Left join df with df2")
df3 = df.join(df2.set_index('City'), on='City',  how='left') 
print(df3, "\n")

#3. Keep rows with population > 55K in dataset
print("Keep values with population > 55K")
df3 = df3[df3['Population'] > 55000]
print(df3, "\n")


# 3. Group by city and count results
print("Group by 'City' and count results")
df3 = df3.groupby(['City']).size()
print(df3, "\n")


## Transforming The Movie Datasets

In the following lesson, we'll use pandas to transform the movies dataset we explored last lesson into a more usable version.

We'll focus on several skills:
* Cleanig our data
* Asking questions about the data
* Changing the data to support analysis
* Using the data to answer our questions

**Loading the data**

last lesson, we loaded each csv into a separate df and examined it. This time let's be more organized by creating one dictionary to store all the different dataframes.

dataframes is a dictionary:
* dataframes['movies'] will contain movies.csv
* dataframes['ratings'] will contain ratings.csv 
and so on

In [None]:
# using "!" on this notebook will run a bash command
! git clone https://github.com/Sagi15G/de_python_course.git

In [None]:
# define a single dictionary to store all the datasets
dataframes = {}

# load each of the datasets into the dictionary
for csv_name in ['ratings', 'tags', 'movies', 'links']:
  dataframes['{}'.format(csv_name)] = pd.read_csv(filepath_or_buffer='de_python_course/data/movies_csv/{}.csv'.format(csv_name),sep=',')

# example for accessing a single df:
dataframes['movies'].head()

### Step 1: Cleaning up the dataset

We want to clean up the 'movies' data so it will be easier to use later on in our analysis. In order to do this, let's look at the data and imagine edge cases it may have. Let's frame these edge cases as questions and answer them. 

Some examples:
* Do we have missing titles? genres?
* Do we have upper/lower case issues with titles?
* Can we have multiple rows for a single movie title?
* Is each movie id unique?
* Do all movies have year information?
* Do all years make sense?
* Do we have the same movie names for different years?

Do you have additional questions you want to ask?

####**Do we have missing titles/genres?**

In [None]:
# do we have missing values
print("Are 'title' values missing?")
print(dataframes['movies'].title.isna().any())

print("Are 'genre' values missing?")
print(dataframes['movies'].genres.isna().any())

# Can genre values have blank spaces?
print("Are there 'genre' values with blank spaces?")
dataframes['movies'][dataframes['movies']['genres'].str.contains(" ")].head()

In [None]:
# Let's rename the missing genres as an 'unknown' genre using a lambda function
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

dataframes['movies']['genres'] = dataframes['movies']['genres'].apply(lambda x: x if x != "(no genres listed)" else "unknown")

dataframes['movies'][dataframes['movies']['genres'].str.contains("unknown")].head()

####**Do we have upper/lower case issues with titles?**

In [None]:
# Our titles are case sensitive - let's change that!
dataframes['movies']['title'] = dataframes['movies']['title'].str.lower()

dataframes['movies'].head()

####**Do we have duplicate rows for a movie title?**

In [None]:
# Do we have duplicate rows for a movie title?
print("Is there a movie title with more than one row?")
dataframes['movies'].title.duplicated().any()


In [None]:
# Which rows have duplicated titles?
dataframes['movies'].title.value_counts()

In [None]:
# Let's return dataframe rows where we have duplicated titles:

titles = dataframes['movies']['title']
duplicates = dataframes['movies'][titles.isin(titles[titles.duplicated()])]
duplicates.sort_values(by=["title"])

# The difference seems to be the number of genres!

In [None]:
# Keep one row per duplicated title
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

# Keep only one row for each duplicated value using drop_duplicates
dataframes['movies'] = dataframes['movies'].drop_duplicates(subset=['title'], keep='last')

print("Is there a movie title with more than one row now?")
dataframes['movies'].title.duplicated().any()

# Ideally, we'd do something smarter, like keep the rows with more genre information. You can try it yourself...

####**Is each movie id unique?**

In [None]:
# Is each movie id unique?
dataframes['movies']['movieId'].value_counts().max()

####**Do all movies have year information?**

In [None]:
# we'll add the 'has_year' column with True/False values based on the title string
dataframes['movies']['has_year'] = dataframes['movies']['title'].apply(lambda x: x.split()[-1].strip("()").isnumeric())

# movies that don't have year information
dataframes['movies'][~dataframes['movies']['has_year']]

####**Do all years make sense?**

In [None]:
# Let's add a year column to the dataset 

# Create a year column assuming title format is: <some string> (year)
# If the year string is available, use it, otherwise set year = -1
dataframes['movies']['year'] = dataframes['movies'][['title','has_year']].apply(lambda x: int(x[0].split()[-1].strip("()")) if x[1] else -1,  axis=1)

# Sample results:
dataframes['movies'][dataframes['movies']['has_year']].head()


In [None]:
# what's the minimal year?
has_year = dataframes['movies'][dataframes['movies']['has_year']]
has_year.sort_values(by=['year']).head()

In [None]:
# update the year calculation to only use 4-digit years
def title_contains_year(x):
  year_string = x.split()[-1].strip("()")
  return year_string.isnumeric() and len(year_string) == 4

# recalculate has_year and year
dataframes['movies']['has_year'] = dataframes['movies']['title'].apply(lambda x: title_contains_year(x))
dataframes['movies']['year'] = dataframes['movies'][['title','has_year']].apply(lambda x: int(x[0].split()[-1].strip("()")) if x[1] else -1,  axis=1)

In [None]:
# what's the minimal year now?
has_year = dataframes['movies'][dataframes['movies']['has_year']]
has_year.sort_values(by=['year']).head()

### Step 2: Asking questions about the data

* Do we have movies with foreign titles?
* Can one movie appear with multiple years in our data?
* What is the average rating per movie title?
* Do recent movies have more ratings on avg than older movies?
* Does positive/negative sentiment towards a movie in the 'tags' dataset translate to high/low average rating?

### Steps 3+4: Changing the data to support analysis & Using the data to answer our questions

####**Do we have movies with foreign titles?**

In [None]:
# Do we have movies in other languages?
# Documentation: https://docs.python.org/3/library/stdtypes.html

# Define a function checking whether all values 
def isEnglish(s):
    return s.isascii()

# Define is_english column by applying the function to each tite
dataframes['movies']['is_english'] = dataframes['movies']['title'].apply(lambda x: isEnglish(x))

# Display a sample of the results
print("Non-english titles:")
dataframes['movies'][~dataframes['movies']['is_english']].head()

####**Can one movie appear with multiple years in our data?**

In [None]:
# Create a new column containing the title name with no year
dataframes['movies']['movie_name'] = dataframes['movies'][['title','has_year']].apply(lambda x: " ".join(x[0].split()[:-1]) if x[1] else x[0],  axis=1)

# Sample results:
dataframes['movies'][dataframes['movies']['has_year']].head()

In [None]:
# do we have the same movie names for different years?

movie_names = dataframes['movies'].groupby(['movie_name']).size() 
movie_names.loc[movie_names.values>1].sort_values(ascending=False)

In [None]:
# let's look at an example from the dataset:
dataframes['movies'][dataframes['movies']['movie_name'] == 'hamlet']

####**What is the average rating per movie title?**

In [None]:
# We'll finally use the ratings dataset! 
# Let's validate all rows have ratings
print("Are there rows with missing ratings?")
dataframes['ratings']['rating'].isna().any()

print("What is the min rating in the dataset?")
dataframes['ratings']['rating'].min()

print("What is the max rating in the dataset?")
dataframes['ratings']['rating'].max()


In [None]:
# get avg rating per movie
print("Get mean rating per movie")
mean_rating = dataframes['ratings'][['movieId', 'rating']].groupby(['movieId']).mean()
print(mean_rating.head(), "\n")

# round avg rating to nearest 0.1 value
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html
print("Rounding to nearest 0.1 value")
mean_rating = mean_rating.round(decimals=1)
print(mean_rating.head(), "\n")

In [None]:
# Match each movie with the avg rating and store in a new dataset

# Left Join (add 'rating' to 'movies' df)
dataframes['combined'] = dataframes['movies'].join(mean_rating, on='movieId',  how='left')

# Rename the 'rating' column to 'mean_rating'
dataframes['combined'] = dataframes['combined'].rename(columns={'rating':'mean_rating'})

# Display a sample of the data
dataframes['combined'].head()

####**Do recent movies have more ratings on avg than older movies?**

In [None]:
# Get number of ratings per movie
print("Get number of ratings per movie")
num_ratings = dataframes['ratings'][['movieId', 'rating']].groupby(['movieId']).size()
num_ratings = num_ratings.to_frame('num_ratings')
num_ratings.head()

In [None]:
# Match each movie with the number of ratings it had

# Left Join to add number of ratings to 'movies' df
dataframes['combined'] = dataframes['combined'].join(num_ratings, on='movieId',  how='left')

# Display a sample of the data
dataframes['combined'].head()

In [None]:
# define movie year buckets

def year_bucket(year):
  if year == -1:
    return 'unknown_year'
  elif year < 1960:
    return '1960-'
  elif year >= 1960 and year < 1970:
    return '1960-1970'
  elif year >= 1970 and year < 1980:
    return '1970-1980'
  elif year >= 1980 and year < 1990:
    return '1980-1990'
  elif year >= 1990 and year < 2000:
    return '1990-2000'
  elif year >= 2000 and year < 2010:
    return '2000-2010'
  else:
    return '2010+'

dataframes['combined']['year_bucket'] = dataframes['combined']['year'].apply(lambda x: year_bucket(x))
dataframes['combined'].head()

In [None]:
dataframes['combined'][['year_bucket', 'num_ratings']].groupby('year_bucket').mean()

####**Does positive/negative sentiment towards a movie in the 'tags' dataset translate to high/low average rating?**

In [None]:
# Let's take a look at the tags data
dataframes['tags'].head()

In [None]:
# Our simplistic hypothesis here would be that we can define a set of positive and negative words. 
# If a "tag" contains a negative word, it conveys negative sentiment towards the movie
# If a "tag" contains a positive word, it conveys positive sentiment towards the movie

# Define a list of words:
negative_words = ["bad", "not", "terrible", "horrible"]
positive_words = ["great", "good", "loved", "like", "fun", "funny"]

In [None]:
# Define functions to determine whether a tag contains a positive or negative word
def contains_negative_words(s):
  for word in s.split():
    if word in negative_words:
      return True
  return False 

def contains_positive_words(s):
  for word in s.split():
    if word in positive_words:
      return True
  return False 

# Creating new columns based on the functions
dataframes['tags']['negative_sentiment'] = dataframes['tags']['tag'].apply(lambda x: contains_negative_words(x)) 
dataframes['tags']['positive_sentiment'] = dataframes['tags']['tag'].apply(lambda x: contains_positive_words(x)) 


In [None]:
# Sample positive sentiment tags
print("Positive sentiment")
print(dataframes['tags'][['tag', 'positive_sentiment']][dataframes['tags']['positive_sentiment']].head(), "\n")

# Sample negative sentiment tags
print("Negative sentiment")
print(dataframes['tags'][['tag', 'negative_sentiment']][dataframes['tags']['negative_sentiment']].head(), "\n")

# Rows with positive AND negative sentiment
print("Positive AND negative sentiment")
print(dataframes['tags']['tag'][(dataframes['tags']['positive_sentiment']) & (dataframes['tags']['negative_sentiment'])], "\n")

In [None]:
# let's mark confused rows appropriately
# villain nonexistent or not needed for good story --> positive sentiment
dataframes['tags'].iloc[2496, 4] = False
dataframes['tags'].iloc[2496, :]

In [None]:
# not funny --> negative sentiment
dataframes['tags'].iloc[2563, 5] = False
dataframes['tags'].iloc[2563, :]

In [None]:
# How many tags with positive/negative sentiment does the movie have?

# Defining a dataset with positive tags only
print("Movies with positive tags:")
positive_tags =  dataframes['tags'][dataframes['tags']['positive_sentiment']]
# Grouping by movie and counting
num_positive_tags = positive_tags.groupby(['movieId']).size()
num_positive_tags = num_positive_tags.to_frame('num_positive_tags')
print(num_positive_tags.head(), "\n")


# Defining a dataset with negative tags only
print("Movies with negative tags:")
negative_tags =  dataframes['tags'][dataframes['tags']['negative_sentiment']]
# Grouping by movie and counting
num_negative_tags = negative_tags.groupby(['movieId']).size()
num_negative_tags = num_negative_tags.to_frame('num_negative_tags')
print(num_negative_tags.head(), "\n")


In [None]:
# Left Join to add positive/negative sentiment to combined data
dataframes['combined'] = dataframes['combined'].join(num_positive_tags, on='movieId',  how='left')
dataframes['combined'] = dataframes['combined'].join(num_negative_tags, on='movieId',  how='left')

In [None]:
# Look at sample data
cols = ['title', 'year', 'mean_rating', 'num_positive_tags', 'num_negative_tags']
dataframes['combined'][cols][(dataframes['combined']['num_positive_tags'] > 0) | (dataframes['combined']['num_negative_tags'] > 0) ].head()

In [None]:
# Fill NaN with 0
dataframes['combined'] = dataframes['combined'].fillna(0)

# Look at sample data again
dataframes['combined'][cols][(dataframes['combined']['num_positive_tags'] > 0) | (dataframes['combined']['num_negative_tags'] > 0) ].head()

In [None]:
# If a move had positive sentiment, did it rate high?

# take movies with positive tags and no negative tags
positives = dataframes['combined'][(dataframes['combined']['num_positive_tags'] > 0) & (dataframes['combined']['num_negative_tags'] == 0)]
positives['mean_rating'].describe()

In [None]:
# if a movie had negative sentiment, did it rate low?

# take movies with negative tags and no positive tags
negatives = dataframes['combined'][(dataframes['combined']['num_negative_tags'] > 0) & (dataframes['combined']['num_positive_tags'] == 0)]
negatives['mean_rating'].describe()