# **Python for Data Science**

---

## Assignment 6

### Pandas I

---

# **README.md**

## Overview

This notebook includes exercises to analyze two datasets: **Netflix** and **Titanic**. Below are the exercises and their descriptions.

## Netflix Dataset Exercises

1. **Is there any missing rating?**
   - **Description:** Counting the number of missing values in each column to understand data completeness.

2. **How many films in 2021 correspond to your country?**
   - **Description:** Replacing missing values in the 'country' column with "Unknown" to handle incomplete data.

3. **What's the number of movies in 2020 with full information?**
   - **Description:** Filtering the dataset to keep only entries of type "TV Show."

4. **Give me the year with more titles.**
   - **Description:** Calculating the number of entries for each unique value in the 'rating' column to understand content distribution.

5. **And what has been the average in terms of releases from 2010.**
   - **Description:** Adding a new column, **content_age**, to calculate the age of the content by subtracting its release year from the current year.

## Titanic Dataset Exercises

1. **Calculate Gender-Based Survival Percentage**
   - **Description:** Calculating the survival percentage for each gender by dividing the number of survivors by the total passengers for each gender.

2. **Calculate Survival Percentage Grouped by Gender and Class**
   - **Description:** Calculating survival percentages for each combination of gender and passenger class.

---

## **Loading the Datasets**

Let's load *Netflix* datasets

In [68]:
import pandas as pd

path = 'netflix_titles.csv'

df = pd.read_csv(path)

In [4]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

Let's now load the *Titanic* dataset.

In [54]:
path = 'train_and_test2.csv'

titanic = pd.read_csv(path)

In [56]:
titanic.columns

Index(['Passengerid', 'Age', 'Fare', 'Sex', 'sibsp', 'zero', 'zero.1',
       'zero.2', 'zero.3', 'zero.4', 'zero.5', 'zero.6', 'Parch', 'zero.7',
       'zero.8', 'zero.9', 'zero.10', 'zero.11', 'zero.12', 'zero.13',
       'zero.14', 'Pclass', 'zero.15', 'zero.16', 'Embarked', 'zero.17',
       'zero.18', '2urvived'],
      dtype='object')

---

# **Exercises**

## Pandas I

Home exercises for Netflix:

1. Is there any missing rating?
2. How many films in 2021 correspond to your country?
3. What's the number of movies in 2020 with full information?
4. Give me the year with more titles.
5. And what has been the average in terms of releases from 2010. 

And for Titanic:

1. Calculate Gender-Based Survival Percentage

2. Calculate Survival Percentage Grouped by Gender and Class

---

# **Netflix**

### Exercise 1: Is there any missing rating?

**Description:** In this exercise, we check if there are any missing values in the 'rating' column.


In [28]:
# Filtering rows where 'rating' is missing

missing_ratings_df = df[df['rating'].isnull()]  # Selecting rows where the 'rating' column is NaN.

# Displaying only rows with missing 'rating' to validate

if not missing_ratings_df.empty:
    print("Rows with missing ratings:")
    print(missing_ratings_df[['rating']])  # Displaying ratings columns for clarity.
else:
    print("No rows with missing ratings found.")

Rows with missing ratings:
     rating
5989    NaN
6827    NaN
7312    NaN
7537    NaN


In [29]:
# Counting the number of missing values in the 'rating' column

print(f"There are {missing_ratings} missing ratings.")

There are 4 missing ratings.


---

### Exercise 2: How many films in 2021 correspond to your country?

**Description:** Here, we count how many films from your country (e.g., "India") were released in 2021.

In [39]:
country_name = "India"  # Specifying my country here.

# Filtering for films in 2021 that correspond to my country
films_in_2021_df = df[(df['release_year'] == 2021) & (df['country'] == country_name) & (df['type'] == 'Movie')]

# Displaying the filtered rows for validation
if not films_in_2021_df.empty:
    print(f"Films from {country_name} in 2021:")
    print(films_in_2021_df[['title', 'release_year', 'country', 'type']])  # Showing relevant columns for clarity.
else:
    print(f"No films from {country_name} in 2021.")

Films from India in 2021:
                              title  release_year country   type
190                      Thimmarusu          2021   India  Movie
551                 Haseen Dillruba          2021   India  Movie
735                         Sarbath          2021   India  Movie
850                        99 Songs          2021   India  Movie
871              Sardar Ka Grandson          2021   India  Movie
873                           Ahaan          2021   India  Movie
877                    Cinema Bandi          2021   India  Movie
903                         Nayattu          2021   India  Movie
909                       Milestone          2021   India  Movie
959                    The Disciple          2021   India  Movie
998            Searching For Sheela          2021   India  Movie
1023                Ajeeb Daastaans          2021   India  Movie
1037             Tuesdays & Fridays          2021   India  Movie
1087                          Roohi          2021   India  Movie

In [40]:
# Counting the number of films matching the criteria

films_in_2021_count = films_in_2021_df.shape[0]

print(f"There are {films_in_2021_count} films from {country_name} in 2021.")

There are 22 films from India in 2021.


---

### Exercise 3: What's the number of movies in 2020 with full information?

**Description:** We count how many movies in 2020 have no missing data in any column.

In [43]:
# Filtering for movies in 2020 with no missing data

movies_2020_full_df = df[(df['release_year'] == 2020) & (df['type'] == 'Movie')].dropna()

# Displaying the filtered rows for validation

if not movies_2020_full_df.empty:
    print("Movies in 2020 with full information:")
    print(movies_2020_full_df[['title', 'release_year', 'country', 'rating', 'type']])  # Displaying relevant columns.
else:
    print("No movies in 2020 with full information.")

Movies in 2020 with full information:
                                   title  release_year        country rating  \
78                        Tughlaq Durbar          2020        Unknown  TV-14   
84                  Omo Ghetto: the Saga          2020        Nigeria  TV-MA   
103                       Shadow Parties          2020        Unknown  TV-MA   
119                       Here and There          2020        Unknown  TV-MA   
126                              Shikara          2020          India  TV-14   
...                                  ...           ...            ...    ...   
3044               Live Twice, Love Once          2020          Spain  TV-MA   
3046       All the Freckles in the World          2020         Mexico  TV-14   
3060                       Ghost Stories          2020          India  TV-MA   
7594  Norm of the North: Family Vacation          2020  United States  TV-Y7   
8099                         Straight Up          2020  United States  TV-MA   

 

In [47]:
# Counting the number of movies with full information

movies_2020_full_count = movies_2020_full_df.shape[0]

print(f"There are {movies_2020_full_count} movies in 2020 with full information.")

There are 458 movies in 2020 with full information.


---

### Exercise 4: Give me the year with more titles.

**Description:** We find the year that has the highest number of titles released.

In [46]:
# Getting the count of titles per year

titles_per_year = df['release_year'].value_counts()

# Displaying the title counts per year for validation

print("Titles per year:")

print(titles_per_year)

Titles per year:
release_year
2018    1147
2017    1032
2019    1030
2020     953
2016     902
        ... 
1959       1
1925       1
1961       1
1947       1
1966       1
Name: count, Length: 74, dtype: int64


In [45]:
# Finding the year with the most titles

year_with_most_titles = titles_per_year.idxmax()  # Getting the year with the highest count.

titles_in_that_year = titles_per_year.max()  # Getting the count for that year.

# Printing the result
print(f"The year with the most titles is {year_with_most_titles} with {titles_in_that_year} titles.")

The year with the most titles is 2018 with 1147 titles.


---

### Exercise 5: And what has been the average in terms of releases from 2010.

**Description:** We calculate the average number of titles released per year from 2010 to the most recent year.

In [49]:
# Filtering for titles from 2010 onwards

titles_from_2010_df = df[df['release_year'] >= 2010]

# Displaying the title counts per year for validation

print("Titles per year from 2010 onwards:")

print(titles_from_2010_df['release_year'].value_counts())

Titles per year from 2010 onwards:
release_year
2018    1147
2017    1032
2019    1030
2020     953
2016     902
2021     592
2015     560
2014     352
2013     288
2012     237
2010     194
2011     185
Name: count, dtype: int64


In [50]:
# Calculating the average number of releases per year

average_releases_2010_onwards = titles_from_2010_df['release_year'].value_counts().mean()

print(f"The average number of releases per year from 2010 onwards is {average_releases_2010_onwards:.2f}.")

The average number of releases per year from 2010 onwards is 622.67.


---

## **Titanic**

### Exercise 1: Calculate Gender-Based Survival Percentage.

**Description:** In this exercise, we calculate the survival percentage for each gender by dividing the number of survivors by the total passengers in each gender group.

In [69]:
# Grouping data by 'Sex' and calculating mean survival rate
gender_survival_percentage = titanic.groupby('Sex')['2urvived'].mean() * 100  # Multiplying by 100 for percentage.

# Displaying the survival percentages for validation
print("Gender-Based Survival Percentage:")
print(gender_survival_percentage)

Gender-Based Survival Percentage:
Sex
0    12.930012
1    50.000000
Name: 2urvived, dtype: float64


In [70]:
# Printing the results
print(f"Male Survival Percentage: {gender_survival_percentage[0]:.2f}%")    # Assuming 0 for Male.
print(f"Female Survival Percentage: {gender_survival_percentage[1]:.2f}%")  # Assuming 0 for Female.

Male Survival Percentage: 12.93%
Female Survival Percentage: 50.00%


---

### Exercise 2: Calculate Survival Percentage Grouped by Gender and Class.

**Description:** Here, we calculate the survival percentage for each combination of gender and class.

In [71]:
# Grouping data by 'Sex' and 'Pclass' and calculating mean survival rate
gender_class_survival_percentage = titanic.groupby(['Sex', 'Pclass'])['2urvived'].mean() * 100  # Multiplying by 100 for percentage.

# Displaying the grouped survival percentages for validation
print("Survival Percentage Grouped by Gender and Class:")
print(gender_class_survival_percentage)

Survival Percentage Grouped by Gender and Class:
Sex  Pclass
0    1         25.139665
     2          9.941520
     3          9.533469
1    1         63.194444
     2         66.037736
     3         33.333333
Name: 2urvived, dtype: float64


In [72]:
# Printing the results
for (sex, pclass), percentage in gender_class_survival_percentage.items():
    gender = "Male" if sex == 0 else "Female"   # Assuming 0 for Male and 1 for Female.
    print(f"{gender} in Class {pclass}: {percentage:.2f}% survival rate")

Male in Class 1: 25.14% survival rate
Male in Class 2: 9.94% survival rate
Male in Class 3: 9.53% survival rate
Female in Class 1: 63.19% survival rate
Female in Class 2: 66.04% survival rate
Female in Class 3: 33.33% survival rate


---