---
---
Problem Set 5: Pandas II (Data Manipulation)

Applied Data Science using Python

New York University, Abu Dhabi

Out: 5th Oct 2023 || **Due: 17th Oct 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Understand Data Manipulation using Pandas
- Bring together all the different concepts learned in the class
- Understand the challenges of data cleaning

### Specific Goals
- Understand Merging and Concatenating
- Be comfortable with multi-level indexing
- Understand How to save Pandas files as csv
- Be able to effectively use various Pandas functions such as
  - `replace()`
  - `extract()`
  - `apply()`
  - `merge()`
  - `concat()`
  - `groupby()`
  - `transform()`
  - `agg()`
  - `filter()`

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **P5_YOUR NETID.ipynb**.

---




# General Instructions
This homework is worth 100 points. It has 3 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in this Jupyter (Colab) Notebook.



# Exploring the Project Data-verse

In your recitation, you used your Pandas knowledge on data manipulation to explore the **Suicide Rates Dataset**. In this homework, you will explore 3 more datasets using your *data wrangling* skills: **(i) Health Trends Dataset**, **(ii) MovieLens Dataset**, and **(iii) Coronavirus Pandemic (COVID-19) Dataset**.

The recitation, and the homework for this week is meant to help you with your decision to choose the problem and the dataset for your project. By the end of this homework, you will have a good idea about the ins and outs of the three project datasets (Suicide Rates, Health Trends, MovieLens) to be able to choose and formulate an interesting hypothesis.

# Part I: Health Trends Dataset: README (35 points)

Typically when a dataset is released to the public, it comes with a **README** file. A README file contains the description, and detailed information about the different folders, files, and fields in the dataset, along with the information about licenses, credits, and citations. It may also contain information on how the data was collected, how many subjects were involved, and so on. A README file is an important part of the dataset as it documents its motivation, composition, collection process, recommended uses, and so on. Furthermore, it facilitates better communication between dataset creators and dataset consumers.

In this part of the homework, you have been given a messy dataset with a clean README file. There is a discrepancy between the dataset we have provided you, and the attached README file as the README file **does not** describe the data. Instead, it describes the *cleaner* version of the dataset that we have hidden from you, and we have no intention of providing that to you. :)

Instead, we want you to manipulate the messy dataset, and clean it so that it adheres to the README file provided. What this means is merging and/or concatenating the different files, removing redundant or unnecessary fields, dealing with NaNs, defining columns properly, sorting the data, and validating that the final dataset adheres to the README file completely.

The messy dataset we have provided corresponds to the **"Health Trends Dataset"**. This dataset is collected from two sources: Google search data from **Google Trends**, and official health statistics from **CDC/BRFSS**. The dataset is for two outcome variables: **obesity** and **exercise**.

There are 3 folders inside **health_trends/** directory that we provide you:

1. **health_statistics/:** This directory contains 6 sub-directories, 3 corresponding to "exercise", and 3 corresponding to "obesity".
  - **exercise_age/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the state-wise exercise related statistics for the U.S *stratified by age group*.
  - **exercise_gender/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the state-wise exercise related statistics for the U.S *stratified by gender*.
  - **exercise_overall/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the *overall* state-wise exercise related statistics for the U.S.
  - **obesity_age/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the state-wise obesity related statistics for the U.S *stratified by age group*.
  - **obesity_gender/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the state-wise obesity related statistics for the U.S *stratified by gender*.
  - **obesity_overall/:** This directory contains 15 files, 1 for each year from 2004 to 2018. Each file contains the *overall* state-wise obesity related statistics for the U.S.

2. **spatial_search_intensity/:** This directory contains 1215 files, 1 for each (year, keywords) pair. Since there are 81 search keywords, and 15 years, this is equal to 1215 files. Each file is named in the **\<year\>_spatial_\<keyword\>.csv** format.

3. **temporal_search_intensity/:** This directory contains 81 files, 1 for each keyword. Each file is named in the **2004_2018_temporal_\<keyword\>.csv** format.

In total, you have **1386** number of files that you need to clean and synthesize into just **3** clean output files.

We have also provided you with the paths to the files in each of the 3 directories above in **stats_paths.txt**, **spatial_paths.txt**, **temporal_paths.txt**. This is so that you can read the csv files within each directory using these path files.

But wait! We told you about the directory structure of the data provided. What is the specification of the output files? What is this data about? What is temporal data? Spatial data? What do we have to do exactly? I am so confused!

Well, to reiterate, you have to convert the given data into 3 files only, such that each file adheres completely to the provided README file. Now, to understand the data, and be able to actually do this part, you will have to first read the README file inside the **health_trends/** directory.

Don't get intimidated by this part. It may look like a tedious task. However, once you actually understand what needs to be done, and get a sense of the dataset, it is not really that hard.

Tips: Once you have read this part including the prompt, go back and open at least one csv file from each directory and sub-directory, and get a feel of what the data looks like. Then read the README file and identify the logic for each output file -- one at a time.

![readme](https://drive.google.com/uc?id=1XeplvYB0L82k0i4RpAjMvbn6Tnb9CJ-X)

## Prompt

More concretely, write a function `clean(spatial_paths, temporal_paths, stats_paths)` that takes in 3 lists of paths, and creates and saves three output files to the output directory. The output directory is called `preprocessed_data/` and the three output files should be named as `spatial_trends.csv`, `temporal_trends.csv` and `health_stats.csv`, each adhering to the fields and specifications provided in the README file.

Notes:
- You should read the README file before starting to code.
- You should simulataneously inspect all the different types of files in the provided **health\_trends** folder to understand what these files look like.
- You must remove the rows which are empty and uninformative. These would be the empty rows at the end of most of the csv files.
- Finally, you should visually inspect to validate that the final dataframes and the output csv files have the required columns and rows sorted as described in the README, and have the required shape. There should be no redundant/extra column(s) or row(s) in the output files, and the column names should match the fields described in the README file.
- You should submit the three generated output files with your notebook. We will also use your code to generate these datasets on our own.
- Because there are many files to process, your program will take some time (in minutes) to complete its execution. This is primarily because of the `read_csv()` function which is an [I/O bound](https://en.wikipedia.org/wiki/I/O_bound) process.


In [101]:
# from google.colab import drive
# drive.mount('/content/drive')

# Commenting these lines out since I run the Jupyter notebook locally

In [102]:
import pandas as pd

"""
You can define the input directory as per your directory structure.
We will define it to the directory that contains the health_trends folder.
"""

input_directory = "health_trends/"

temporal_paths = list(map(lambda e:e.rstrip(),
                          open(input_directory+"temporal_paths.txt").readlines()))

spatial_paths = list(map(lambda e:e.rstrip(),
                         open(input_directory+"spatial_paths.txt").readlines()))

stats_paths = list(map(lambda e:e.rstrip(),
                       open(input_directory+"stats_paths.txt").readlines()))

In [103]:
stats_paths

['health_statistics/obesity_age/2008.csv',
 'health_statistics/obesity_age/2009.csv',
 'health_statistics/obesity_age/2018.csv',
 'health_statistics/obesity_age/2015.csv',
 'health_statistics/obesity_age/2014.csv',
 'health_statistics/obesity_age/2016.csv',
 'health_statistics/obesity_age/2017.csv',
 'health_statistics/obesity_age/2013.csv',
 'health_statistics/obesity_age/2007.csv',
 'health_statistics/obesity_age/2006.csv',
 'health_statistics/obesity_age/2012.csv',
 'health_statistics/obesity_age/2004.csv',
 'health_statistics/obesity_age/2010.csv',
 'health_statistics/obesity_age/2011.csv',
 'health_statistics/obesity_age/2005.csv',
 'health_statistics/exercise_gender/2008.csv',
 'health_statistics/exercise_gender/2009.csv',
 'health_statistics/exercise_gender/2018.csv',
 'health_statistics/exercise_gender/2015.csv',
 'health_statistics/exercise_gender/2014.csv',
 'health_statistics/exercise_gender/2016.csv',
 'health_statistics/exercise_gender/2017.csv',
 'health_statistics/exerci

In [104]:
# Dont change the output directory and save your output files to this folder
output_directory = "preprocessed_data/"

def clean(spatial_paths, temporal_paths, stats_paths):
  # Write you solution here
  ############# SOLUTION ###############
  # --- Temporal Data ---
  temporal_df = pd.DataFrame()
  
  # Read first file and add it to the dataframe
  temporal_df = pd.read_csv(input_directory + temporal_paths[0])
  
  # Pre-emtively create a date column by converting the 'date' column to datetime
  # This prevents an error I was getting when parsing dates from all files and converting to datetime
  temporal_df['date'] = pd.to_datetime(temporal_df['date'], format='%Y-%m-%d')
  
  # Drop the isPartial column since it is not required
  temporal_df = temporal_df.drop(columns=['isPartial'])

  # Go through all other files and merge them into our dataframe
  for path in temporal_paths[1:]:
    df = pd.read_csv(input_directory + path)
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df = df.drop(columns=['isPartial'], errors='ignore')
    temporal_df = pd.merge(temporal_df, df, on='date', how='outer')

  temporal_df.to_csv(output_directory + 'temporal_trends.csv', index=False)
  
  print(f'Shape of temporal_df: {temporal_df.shape}') # Expected: (180, 82)
  # --- Temporal Data ---
  
  # --- Spatial Data ---
  final_spatial_df = pd.DataFrame()  # This will hold all the years' data

  years = list(range(2004, 2019))
  for year in years:
    # Filter files for the current year
    current_year_files = [f for f in spatial_paths if str(year) in f]
        
    # Start with the first file for the year
    base_file = current_year_files[0]
    base_df = pd.read_csv(input_directory + base_file)
    base_df['year'] = year  # Add the year column
        
    # Iterate through all other files for the year and merge them
    for file in current_year_files[1:]:
      current_df = pd.read_csv(input_directory + file)
            
      # Extract keyword from filename (removing the 'year_spatial_' prefix and '.csv' suffix)
      keyword = file.split('/')[-1].replace(f"{year}_spatial_", "").replace(".csv", "")
            
      # Rename the second column to the keyword
      current_df.columns = ["geoName", keyword]
            
      # Merge with base dataframe
      base_df = pd.merge(base_df, current_df, on='geoName', how='left')
            
    # Append the data for this year to the final dataframe
    final_spatial_df = pd.concat([final_spatial_df, base_df], ignore_index=True)

  # Sort the dataframe
  final_spatial_df = final_spatial_df.sort_values(by=['year', 'geoName'])
  
  # Reordering columns
  column_order = ["geoName", "year"]
  sorted_keywords = sorted([col for col in final_spatial_df.columns if col not in column_order])
  column_order.extend(sorted_keywords)
  final_spatial_df = final_spatial_df[column_order]

  # Save the consolidated dataframe to the specified CSV file
  final_spatial_df.to_csv(output_directory + 'spatial_trends.csv', index=False)
  
  print(f'Shape of spatial_df: {final_spatial_df.shape}') # Expected: (765, 83)
  # --- Spatial Data ---
  
  # --- Health Statistics Data ---
  # Go through all paths in the file and merge them into a single dataframe
  health_stats_df = pd.DataFrame()
  for path in stats_paths:
    df = pd.read_csv(input_directory + path)
    variable = path.split('/')[1].split('_')[0]
    df['variable'] = variable
    health_stats_df = pd.concat([health_stats_df, df], ignore_index=True)
  
  # Clean up column names in health_stats_df by making them lowercase and removing spaces
  health_stats_df.columns = [col.lower().strip() for col in health_stats_df.columns]
  
  # List of columns to keep
  columns_to_keep = [
    "id",
    "year",
    "locationabbr",
    "locationdesc",
    "data_value",
    "low_confidence_limit",
    "high_confidence_limit",
    "sample_size",
    "stratification",
    "stratificationtype",
    "variable"
  ]
  
  health_stats_df = health_stats_df[columns_to_keep]
  
  # Filter out uninformative rows based on 'id' column
  health_stats_df = health_stats_df[health_stats_df['id'].notna()]
  
  # Sort the rows by “variable”, then “year”, and then “id” in the ascending order
  health_stats_df = health_stats_df.sort_values(by=["variable", "year", "id"])
    
  health_stats_df.to_csv(output_directory + 'health_stats.csv', index=False)

  print(f'Shape of health_stats_df: {health_stats_df.shape}') # Expected: (14442, 11)
  
  # --- Health Statistics Data ---
  
  ############# SOLUTION END ###############

In [105]:
# How we will call your function

clean(spatial_paths, temporal_paths, stats_paths)

Shape of temporal_df: (180, 82)
Shape of spatial_df: (765, 83)
Shape of health_stats_df: (14442, 11)


## *Concepts required to complete this task*

*   Data Merging
*   Data Concatenation
*   Dataframes Indexing, Sorting, Method Chaining, and Manipulation
*   Reading and Writing CSV files as Dataframes and vice-versa.
*   (Optionally) `reduce()` function from `functools`



## Rubric

- +8 points for correctness of the `temporal.csv` file (as per the specifications of the README file, and by using correct pandas techniques)
- +8 points for correctness of the `spatial.csv` file (as per the specifications of the README file, and by using pandas)
- +14 points for correctness of the `health.csv` file (as per the specifications of the README file, and by using pandas)
- +5 points for proper comments and variable names.

# Part II: MovieLens Dataset: Popular Movies and Biases (35 points)

In this part, you will be manipulating the **MovieLens Dataset**. Before moving on to the prompts for this section, please read the **README.txt** file provided with the dataset. Also visually inspect the different files in the dataset along with the fields/columns in each file to get a sense of how the data looks like.


## A. Most Popular Movies (15 points)

Naturally, with the movie ratings dataset, we would like to first know which movies were rated the highest, and what were their genres. To be able to do that, we would like to

1. Create a DataFrame with the following six columns:
  *   **movie_id**: this is the unique identifier of the movie as provided in the dataset
  *   **movie_title**: this is the title of the movie as provided in the dataset
  *   **release_date**: this is the date of release as provided in the dataset
  *   **genre(s)**: this is the genre of the movie or combination of genres. This is not a 0 or 1 value, but the actual name of the genre.

      *Note: If the movie has more than one genre, those genres should be appended with "and" in order to create new genres. For example, if the movie genres was Comedy, Drama and Action, you would need to combine this to create a new genre called "Action and Comedy and Drama". You should combine them in (alphabetical or some other fixed) order so that if you ever need to group them by genres, it's easier and meaningful to do so. In other words, you would not want both "Action and Comedy and Drama" as well as "Comedy and Action and Drama" as they are practically the same groups*
  *   **average_rating**: this is the average rating of the movie computed by averaging all the ratings given
  *   **number_of_raters**: this is the number of raters for each movie

2. We would like to then filter out/remove movies with just `unknown` genre, movies with less than 50 raters, and movies with average rating of less than 3.

3. Finally, we would like to sort the DataFrame in the descending order by `average_rating` to view the top rated movies.

You solution should make use of Pandas functions wherever necessary, and use no loops at all.

Your code should not be more than 15 lines/statements of code. Our reference solution is 5 lines of code, aside from the 4 lines of reading files.


In [106]:
# Write you solution here

############# SOLUTION ###############
# Read the CSV files
users = pd.read_csv('movielens/user.csv', sep='\t', header=None, names=['user_id', 'age', 'gender', 'occupation', 'zip_code'], encoding="ISO-8859-1")
movies = pd.read_csv('movielens/movie.csv', sep='\t', header=None, names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL'] + list(pd.read_csv('movielens/genre.csv', sep='\t', header=None)[0].values), encoding="ISO-8859-1")
data = pd.read_csv('movielens/data.csv', sep='\t', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], encoding="ISO-8859-1")

# Create a DataFrame with desired columns
movie_ratings = data.groupby('movie_id').agg(average_rating=('rating', 'mean'), number_of_raters=('rating', 'count')).reset_index()
merged = movies.merge(movie_ratings, on='movie_id', how='left')
merged['genre(s)'] = merged.drop(columns=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL', 'average_rating', 'number_of_raters']).dot(pd.read_csv('movielens/genre.csv', sep='\t', header=None).set_index(0).index + ' and ').str.rstrip(' and ')

# Filter and sort the dataframe
result = merged[['movie_id', 'movie_title', 'release_date', 'genre(s)', 'average_rating', 'number_of_raters']]
result = result[(result['genre(s)'] != 'unknown') & (result['number_of_raters'] >= 50) & (result['average_rating'] >= 3)].sort_values(by='average_rating', ascending=False)

result


############# SOLUTION END ############

Unnamed: 0,movie_id,movie_title,release_date,genre(s),average_rating,number_of_raters
407,408,"Close Shave, A (1995)",28-Apr-1996,Animation and Comedy and Thriller,4.491071,112
317,318,Schindler's List (1993),01-Jan-1993,Drama and War,4.466443,298
168,169,"Wrong Trousers, The (1993)",01-Jan-1993,Animation and Comedy,4.466102,118
482,483,Casablanca (1942),01-Jan-1942,Drama and Romance and War,4.456790,243
113,114,Wallace & Gromit: The Best of Aardman Animatio...,05-Apr-1996,Animatio,4.447761,67
...,...,...,...,...,...,...
701,702,Barcelona (1994),01-Jan-1994,Comedy and Romance,3.018868,53
475,476,"First Wives Club, The (1996)",14-Sep-1996,Comedy,3.018750,160
1047,1048,She's the One (1996),23-Aug-1996,Comedy and Romance,3.013699,73
37,38,"Net, The (1995)",01-Jan-1995,Sci-Fi and Thriller,3.008333,120


### *Concepts required to complete this task*

*   Reading CSV files using Pandas
*   Method Chaining
*   Merging and Concatenation
*   (Boolean) Indexing
*   Data Manipulation Using Pandas
*   Replacing
*   Applying
*   Aggregating

### Rubric

- +10 points for correctness using Pandas library
- +3 points for conciseness
- +2 points for proper comments and variable names

## B. Does gender affect rating? (20 points)

People perceive things differently based on their identity, culture, age, gender, and values. In [some studies](https://en.wikipedia.org/wiki/Complimentary_language_and_gender), women are known to give and receive more compliments than men. In terms of culture, [Germans are stereotypically perceived as stiff and more critical](https://en.wikipedia.org/wiki/Stereotypes_of_Germans). Furthermore, there may be conceptual differences: a rating of "3" on a particular movie may mean just "OK" for one person, and may mean "Good" for another. Some people may just be highly optimistic/positive,and never assign a rating of less than "4". Some people may just hate certain genres, and like the others. Notice how in such cases ratings are not dependent on the movie (which is the main goal) but on the characteristics/traits of the users.

In data, these issues are sometimes known as *annotator biases* or *rater biases*, and the characteristics of people that define these biases are known as *covariates*. In an ideal case, (movie) ratings should be independent of these biases. But that is rarely the case. Characterizing and understanding these differences is a challenging problem. However, we generally like to account for these differences and control for them: accounting/normalizing for them when creating a machine learning model *(as we will see later in the course)*, and controlling for them when studying relationships between variables *(as you may learn if you take a causal inference/computational social science course)*.

Age and gender are classic covariates. That is the reason why you would notice that most datasets like *MovieLens* tend to provide information about their raters'/subjects' gender, age, location, and so on. Controlling for such *covariates/confounders* is beyond the scope of this course.

That said, in this task, we would like to do two very simple experiments to just inspect if there are biases introduced by gender in the dataset:

(i) For each **movie title**, we would like to get mean ratings per gender; and then compute the absolute difference in mean ratings. While this by itself is not enough, if by visual inspection most movies have higher difference in mean ratings, that could signal towards gender bias;

(ii) Regardless of the movie, for each **genre** (except the `unknown` genre), we would also like to compute the mean ratings per gender, and then compute the absolute difference in mean ratings.

But how would you group movies by genre when one movie can have multiple genres, and especially **different number** of genres. Well, that is exactly why we made you do Part A to sort of group different combinations of genres to create new ones :-).

In the end, we want two DataFrames:

(i) one named `df_movie_gender` with the following columns:

  *   **movie_title**: this is the title of the movie
  *   **difference in average ratings**: this is the absolute difference in average ratings of male and female raters for each movie

(ii) second named `df_genre_gender` with the following columns:

  *   **genre**: this is the name of the genre (or a composite genre)
  *   **difference in average ratings**: this is the absolute difference in average ratings of male and female raters for each genre

Both of your dataframes should be sorted by the column **difference in average ratings** in descending order.

Note: Just by doing the above, you may not be able to visually tell if there is a significant difference in terms of ratings of different genders; for that you will have to do significance testing, which is not part of this assignment, but something you will learn soon. For now, we basically mean to compute the differences in the average ratings of male versus female for different movies, and for different genres to visually inspect if we consistently see large differences or not.

### Quick Note:

Based on a discussion with Prof. Bedoor, I have used the `result` dataframe from part A to form the `df_movie_gender` and `df_genre_gender` dataframes. While we can use the csv files provided to form the required dataframes, I believe that the solution remains considerably more concise and readable if we use the `result` dataframe from part A, especially while creating the `df_genre_gender` dataframe where a given movie can belong to multiple genres.

In [107]:
# Write you solution here
df_movie_gender = None
############# SOLUTION FOR df_movie_gender ###############
# Merge results from the previous part with users dataframe
merged = result.merge(data, on='movie_id', how='left').merge(users, on='user_id', how='left')
merged = merged.drop(columns=['timestamp', 'zip_code', 'occupation', 'age', 'average_rating'])

# Group movies by title and gender and calculate the average rating
avg_ratings = merged.groupby(['movie_title', 'gender'])['rating'].mean().unstack()

# Use the gender-wise average ratings to calculate the difference in average ratings
avg_ratings['difference in average ratings'] = (avg_ratings['M'] - avg_ratings['F']).abs()
df_movie_gender = avg_ratings.reset_index()[['movie_title', 'difference in average ratings']]

# Sort the dataframe by the difference in average ratings in descending order
df_movie_gender = df_movie_gender.sort_values(by='difference in average ratings', ascending=False)

df_movie_gender
############# SOLUTION END ############

gender,movie_title,difference in average ratings
379,Ran (1985),1.107143
257,Jane Eyre (1996),1.067368
363,"Postman, The (1997)",1.066924
177,First Knight (1995),1.036577
111,"Cook the Thief His Wife & Her Lover, The (1989)",0.927363
...,...,...
179,"Fish Called Wanda, A (1988)",0.001099
95,Circle of Friends (1995),0.000693
176,"Firm, The (1993)",0.000483
391,Return of the Jedi (1983),0.000232


In [108]:
# Write you solution here
df_genre_gender = None
############# SOLUTION FOR df_genre_gender ###############
# Splitting the composite genres into separate rows
df_expanded_genres = merged.explode('genre(s)', ignore_index=True)

# Group movies by gender and genre and calculate the average rating
avg_ratings_genre = df_expanded_genres.groupby(['genre(s)', 'gender'])['rating'].mean().unstack()

# Find the absolute difference in average ratings
avg_ratings_genre['difference in average ratings'] = (avg_ratings_genre['M'] - avg_ratings_genre['F']).abs()
df_genre_gender = avg_ratings_genre.reset_index()[['genre(s)', 'difference in average ratings']]

# Sort the dataframe by the difference in average ratings in descending order
df_genre_gender = df_genre_gender.sort_values(by='difference in average ratings', ascending=False)

df_genre_gender

############# SOLUTION End ###############

gender,genre(s),difference in average ratings
12,Action and Adventure and Drama and Romance,1.036577
52,Action and Wester,0.762913
28,Action and Crime and Romance,0.727273
146,Sci-Fi and Thriller,0.625616
117,Drama and Mystery and Sci-Fi and Thriller,0.612732
...,...,...
80,Children's and Drama and Fantasy and Sci-Fi,0.008095
113,Drama and Fantasy and Thriller,0.006410
134,Horror,0.002144
33,Action and Drama and Romance,0.001981


## *Concepts required to complete this task*

*   Reading CSV files using Pandas
*   Method Chaining
*   Merging and Concatenation
*   (Boolean) Indexing
*   Data Manipulation Using Pandas
*   Replacing
*   Applying
*   Aggregating

### Rubric

- +8 points for correctness of `df_movie_gender` using proper logic
- +8 points for correctness of `df_genre_gender` using proper logic
- +3 points for conciseness
- +1 points for proper comments and variable names

# Part III: COVID-19 Dataset: Planning Average Joe's Vacation (30 points)

Your rich and adventurous uncle named **Average Joe** does not believe in Coronavirus. He thinks COVID-19 is a hoax. He's fed-up of just sitting at home, and would love to travel. He wants to travel to a country that has consistently been less stringent in terms of its rules and policies. He has been nagging you to help him find such a country for 40 *not-so-magical* points in this assignment. You have no choice, but to help him. :-(

In this part you will do that by manipulating the **Coronavirus Pandemic (COVID-19) Dataset**. Before moving on to the prompts in this section, please read the **README** file provided with the dataset. Also visually inspect the different files in the dataset along with the fields/columns in each file.

## A. Least Stringent Nations (15 points)

In this part, you will use the **Government Response Stringency Index** to figure out the least stringent nations. You would first compute the average stringency index for each country by month, and then you would `quantize` that.

*Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set*.

In this task, you will quantize these average stringency indices into three groups (less_stringent, somewhat_stringent, extremely_stringent) based on the following rules:

  - less_stringent: average stringency index <= 40
  - somewhat_stringent: average stringency index > 40 but <= 70
  - extremely_stringent: average stringency index > 70

Once you have grouped, aggregated, and quantized these values, we would like you to filter/remove all countries which have either *ever* been *extremely_stringent* or *ever* been *somewhat_stringent*, and provide a Series/list of the remaining countries. In other words, we want countries that have always been *less_stringent*.

These are the countries, your uncle Joe will use to decide from.

Your solution should not be more than 20 lines of code, and should use no loops at all.

### Quick note:

Since some countries in the dataset do not have complete or any data in the `stringency_index` column, I have removed the rows where the `stringency_index` is missing via the `notna()` function. This is done to ensure that the average stringency index is not affected by the missing values.

However, this leads us to a scenario where some countries do not have data for stringency for some months of the year. For the purposes of this assignment, I will consider countries that are less stringent for the months where they have data, even if they do not have data for all months of the year (for example: Vanuatu has stringency data for just 3 months of the year, and they have been less stringent for those months, so I will consider Vanuatu as a less stringent country).

In [109]:
# Write you solution here

############# SOLUTION ###############
# Load the dataset
owid_data = pd.read_csv("coronavirus_pandemic/owid-covid-data.csv")

# Remove rows where stringency index is null
owid_data = owid_data[owid_data['stringency_index'].notna()]

# Convert the date to a datetime object and extract the month
owid_data['date'] = pd.to_datetime(owid_data['date'])
owid_data['month'] = owid_data['date'].dt.month

# Compute the average stringency index by country and month
avg_stringency = owid_data.groupby(['iso_code', 'location', 'month'])['stringency_index'].mean().reset_index()

# Quantize the stringency index
conditions = [
    avg_stringency['stringency_index'] <= 40,
    (avg_stringency['stringency_index'] > 40) & (avg_stringency['stringency_index'] <= 70),
    avg_stringency['stringency_index'] > 70
]

values = ['less_stringent', 'somewhat_stringent', 'extremely_stringent']

avg_stringency['stringency_category'] = pd.cut(avg_stringency['stringency_index'], 
                                               bins=[0, 40, 70, 100], 
                                               labels=values, 
                                               right=False)

# Filter out countries that have ever been 'extremely_stringent' or 'somewhat_stringent'
non_stringent_countries = avg_stringency[~avg_stringency['iso_code'].isin(
    avg_stringency[avg_stringency['stringency_category'].isin(['extremely_stringent', 'somewhat_stringent'])]['iso_code'])]

non_stringent_countries = non_stringent_countries[['iso_code', 'location']].drop_duplicates()

print("Less stringent countries:")
non_stringent_countries

############# SOLUTION END ############

Less stringent countries:


Unnamed: 0,iso_code,location
108,BDI,Burundi
215,BLR,Belarus
672,FRO,Faeroe Islands
758,GRL,Greenland
1356,NIC,Nicaragua
1895,TWN,Taiwan
2003,VUT,Vanuatu


### *Concepts required to complete this task*

*   Data Manipulation using Pandas
*   Method Chaining
*   Grouping
*   Applying
*   Aggregating


### Rubric

- +10 points for correctness using Pandas library
- +3 points for conciseness
- +2 points for proper comments and variable names

##B. Average Joe loves gatherings (10 points)

**Average Joe** would also like to attend as many gatherings as possible wherever he travels, and would like to know the names of countries which have had **no restrictions on gatherings** for **most number of days** in the data.

As part of this problem, your solution should result in a DataFrame that consists of the following columns:
*   **country_name**: this is the name of the country
*   **days_with_no_gathering_restrictions**: this is the number of days that the given country has had no restrictions on gatherings.

The DataFrame should be sorted in descending order by the **days_with_no_gathering_restrictions** column.

Your solution should be no more than 3 statements (this includes reading the file, manipulation, and sorting).



In [110]:
# Write you solution here

############# SOLUTION ###############
# 1. Read the CSV file
gathering_restrictions_df = pd.read_csv('coronavirus_pandemic/c4.csv')

# 2. Compute the days with no gathering restrictions for each country
gathering_restrictions_df['days_with_no_gathering_restrictions'] = (gathering_restrictions_df.iloc[:, 2:] == 0).sum(axis=1)

# 3. Extract required columns and sort the dataframe
sorted_gathering_restrictions_df = gathering_restrictions_df[['country_name', 'days_with_no_gathering_restrictions']].sort_values(by='days_with_no_gathering_restrictions', ascending=False)
sorted_gathering_restrictions_df


############# SOLUTION END ############

Unnamed: 0,country_name,days_with_no_gathering_restrictions
146,Solomon Islands,398
170,Taiwan,398
181,Yemen,398
103,Macao,398
91,Kiribati,398
...,...,...
30,Switzerland,58
80,Iraq,55
32,China,21
145,Singapore,0


### *Concepts required to complete this task*

*   Data Manipulation Using Pandas


### Rubric

- +5 points for correctness using Pandas library
- +3 points for conciseness
- +2 points for proper comments and variable names

## C. Which country should he visit? (5 points)

Based on the countries you yielded from Part A, and the ranking you computed from Part B, which country should be on the top list of priorities for Uncle Joe. This would be the country that will be the first country from the top in part B that is also present in the list of countries from Part A.


Write 1 line of code that takes the list from Part A, DataFrame from Part B, and retrieves the top choice for the `country_name` along with its value for the `days_with_no_gathering_restrictions`.

In [111]:
# Write you solution here

############# SOLUTION ###############
# Find the first country in the sorted gathering restrictions dataframe that is in the list of non-stringent countries
top_choice = sorted_gathering_restrictions_df[sorted_gathering_restrictions_df['country_name'].isin(non_stringent_countries['location'])].iloc[0]

print(f'Country: {top_choice["country_name"]} | Days with no gathering restrictions: {top_choice["days_with_no_gathering_restrictions"]}')

############# SOLUTION END ############

Country: Taiwan | Days with no gathering restrictions: 398


### *Concepts required to complete this task*

*   Data Manipulation Using Pandas

### Rubric

- +3 points for correctness
- +2 points for conciseness