# Intro to Data Science - Homework 6 - Spring 2025 - Wilmington College
## Due Date: April 28, 2025

## Exercise 1: Group-Based Aggregation and Filtering  
Consider a dataset containing information about customer purchases across different store locations. The dataset is structured as follows:

In [2]:
import pandas as pd

data = {
    'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108],
    'Store': ['East', 'West', 'East', 'North', 'West', 'North', 'East', 'West'],
    'Purchase_Amount': [250, 300, 150, 400, 220, 500, 100, 330]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Customer_ID,Store,Purchase_Amount
0,101,East,250
1,102,West,300
2,103,East,150
3,104,North,400
4,105,West,220
5,106,North,500
6,107,East,100
7,108,West,330


### Tasks:
a) Group the dataset by 'Store' and calculate the average purchase amount for each store.  
b) Filter out the customers whose purchase amount is above the average for their store using `transform`.  
c) Create a new column called `'Above_Avg'` that marks `True` if the customer’s purchase is above the store average, and `False` otherwise.

In [None]:
import pandas as pd

# dataset
data = {
    'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108],
    'Store': ['East', 'West', 'East', 'North', 'West', 'North', 'East', 'West'],
    'Purchase_Amount': [250, 300, 150, 400, 220, 500, 100, 330]
}

df = pd.DataFrame(data)

# a) Group by 'Store' and calculate average purchase amount
store_avg = df.groupby('Store')['Purchase_Amount'].mean()

# b) Filter customers whose purchase amount is above the average for their store
df['Store_Avg'] = df.groupby('Store')['Purchase_Amount'].transform('mean')
filtered_df = df[df['Purchase_Amount'] > df['Store_Avg']]

# c) Create 'Above_Avg' column
df['Above_Avg'] = df['Purchase_Amount'] > df['Store_Avg']

# final dataframe
print(df)


   Customer_ID  Store  Purchase_Amount   Store_Avg  Above_Avg
0          101   East              250  166.666667       True
1          102   West              300  283.333333       True
2          103   East              150  166.666667      False
3          104  North              400  450.000000      False
4          105   West              220  283.333333      False
5          106  North              500  450.000000       True
6          107   East              100  166.666667      False
7          108   West              330  283.333333       True


## Exercise 2: Group Transformations

### Dataset: [Students Performance in Exams](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams)  
Download and load the dataset `StudentsPerformance.csv`.

a) Write a function `standardize_scores` that takes a group and standardizes the **math score** (z-score normalization) grouped by **parental level of education**.  

b) Apply your function and create a new column `math_score_z` in the dataframe.  

c) Compute the rank of students within each **test preparation course** group based on their **reading score**.

In [None]:

# dataset
df = pd.read_csv('StudentsPerformance.csv')

# a) Function to standardize math scores within each parental education group
def standardize_scores(group):
    return (group - group.mean()) / group.std()

# Apply standardization and a new column
df['math_score_z'] = df.groupby('parental level of education')['math score'].transform(standardize_scores)

# b) test preparation course group based on reading score
df['reading_score_rank'] = df.groupby('test preparation course')['reading score'].rank(ascending=False)

# updated dataframe
print(df.head())



   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  \
0                    none          72             72             74   
1               completed          69             90             88   
2                    none          90             95             93   
3                    none          47             57             44   
4                    none          76             78             75   

   math_score_z  reading_score_rank  
0      0.174666               238.5  
1      0.130769                39.0  
2      1.336568                12.0  


## Exercise 3: Multi-Index Pivot Table  
The dataset below records the number of items sold by different employees across quarters and product categories.

```python
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3'],
    'Category': ['Electronics', 'Electronics', 'Furniture', 'Furniture', 'Clothing', 'Clothing', 'Clothing', 'Electronics', 'Furniture'],
    'Units_Sold': [30, 20, 15, 25, 40, 10, 35, 22, 18]
}

df = pd.DataFrame(data)
df
```

### Tasks:
a) Create a pivot table that shows the total number of units sold for each employee across quarters.  
b) Create a pivot table with both 'Quarter' and 'Category' as the index, and employees as columns, summarizing total units sold.  
c) Add a `margins` row and column to the pivot table in (b) to show totals.  
d) Determine which employee sold the most total units using the pivot table.

In [None]:
# dataset
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3'],
    'Category': ['Electronics', 'Electronics', 'Furniture', 'Furniture', 'Clothing', 'Clothing', 'Clothing', 'Electronics', 'Furniture'],
    'Units_Sold': [30, 20, 15, 25, 40, 10, 35, 22, 18]
}

df = pd.DataFrame(data)

# a) Pivot table: total units sold per employee across quarters
pivot1 = pd.pivot_table(df, values='Units_Sold', index='Quarter', columns='Employee', aggfunc='sum')

# b) Pivot table: Quarter and Category as index, employees as columns
pivot2 = pd.pivot_table(df, values='Units_Sold', index=['Quarter', 'Category'], columns='Employee', aggfunc='sum')

# c) Add margins (totals) to pivot table
pivot2_margins = pd.pivot_table(df, values='Units_Sold', index=['Quarter', 'Category'], columns='Employee', aggfunc='sum', margins=True, margins_name='Total')

# d) employee that sold the most total units
total_units_per_employee = df.groupby('Employee')['Units_Sold'].sum()
top_employee = total_units_per_employee.idxmax()

# outputs
print(pivot1)
print(pivot2)
print(pivot2_margins)
print(f"\nEmployee with most units sold: {top_employee}")





Employee  Alice  Bob  Charlie
Quarter                      
Q1           30   20       15
Q2           25   40       10
Q3           35   22       18
Employee             Alice   Bob  Charlie
Quarter Category                         
Q1      Electronics   30.0  20.0      NaN
        Furniture      NaN   NaN     15.0
Q2      Clothing       NaN  40.0     10.0
        Furniture     25.0   NaN      NaN
Q3      Clothing      35.0   NaN      NaN
        Electronics    NaN  22.0      NaN
        Furniture      NaN   NaN     18.0
Employee             Alice   Bob  Charlie  Total
Quarter Category                                
Q1      Electronics   30.0  20.0      NaN     50
        Furniture      NaN   NaN     15.0     15
Q2      Clothing       NaN  40.0     10.0     50
        Furniture     25.0   NaN      NaN     25
Q3      Clothing      35.0   NaN      NaN     35
        Electronics    NaN  22.0      NaN     22
        Furniture      NaN   NaN     18.0     18
Total                 90.0  82.

## Exercise 4: Pivot Tables

### Dataset: [Video Game Sales](https://www.kaggle.com/datasets/gregorut/videogamesales)  
Use the dataset `vgsales.csv`.

a) Create a pivot table that shows the **total global sales** for each **genre** by **platform**.  

b) Find the average **NA sales** for each genre across all platforms.  

c) Extend the pivot table to include row and column totals (margins).  

d) Identify the platform with the **highest total global sales** and the genre that dominates that platform.


In [None]:
# dataset
df = pd.read_csv('vgsales.csv')

# a) Pivot table: total global sales by genre and platform
pivot_genre_platform = pd.pivot_table(df, values='Global_Sales', index='Genre', columns='Platform', aggfunc='sum')

# b) Average NA sales for each genre
avg_na_sales_per_genre = df.groupby('Genre')['NA_Sales'].mean()

# c) Extend pivot table to include totals (margins)
pivot_genre_platform_margins = pd.pivot_table(df, values='Global_Sales', index='Genre', columns='Platform', aggfunc='sum', margins=True, margins_name='Total')

# d) Identify platform with highest total global sales and its dominating genre
total_sales_per_platform = df.groupby('Platform')['Global_Sales'].sum()
top_platform = total_sales_per_platform.idxmax()

# Filtered the dataset for the top platform and find the genre with highest sales
top_platform_data = df[df['Platform'] == top_platform]
top_genre_for_platform = top_platform_data.groupby('Genre')['Global_Sales'].sum().idxmax()

# outputs
print(pivot_genre_platform)
print(avg_na_sales_per_genre)
print(pivot_genre_platform_margins)
print(f"\nTop platform by total global sales: {top_platform}")
print(f"Dominating genre on {top_platform}: {top_genre_for_platform}")




Platform       2600   3DO    3DS    DC      DS     GB    GBA     GC    GEN  \
Genre                                                                        
Action        29.34   NaN  57.02  1.26  115.56   7.92  55.76  37.84   2.74   
Adventure      1.70  0.06   4.81  2.50   47.29  17.16  14.68   5.93   0.19   
Fighting       1.24   NaN  10.46  1.83    7.20    NaN   4.21  18.43   5.90   
Misc           3.58   NaN  10.48   NaN  137.76  13.35  36.25  16.73   0.03   
Platform      13.27   NaN  32.23  2.54   77.45  54.91  78.30  28.66  15.45   
Puzzle        14.68  0.02   5.57   NaN   84.29  47.47  12.92   4.70    NaN   
Racing         2.91   NaN  14.49  2.65   38.64   4.55  18.80  21.89   0.26   
Role-Playing    NaN   NaN  75.74  0.68  126.85  88.24  64.21  13.15   0.27   
Shooter       26.48   NaN   1.29  0.33    8.20   1.20   3.60  13.63   0.13   
Simulation     0.45  0.02  27.08  0.52  132.03   3.55   5.91   8.59    NaN   
Sports         3.43   NaN   6.20  3.66   31.83   9.05  16.41  25

## Exercise 5: Cross-Tabulation with Normalization  
Consider the dataset of app users and their preferences:

```python
data = {
    'Age_Range': ['18-24', '25-34', '35-44', '18-24', '25-34', '35-44', '25-34', '18-24'],
    'Platform': ['iOS', 'Android', 'iOS', 'Android', 'iOS', 'Android', 'iOS', 'iOS'],
    'Preferred_Feature': ['Dark Mode', 'Notifications', 'Offline Access', 'Dark Mode', 'Dark Mode', 'Offline Access', 'Notifications', 'Notifications']
}

df = pd.DataFrame(data)
df
```

### Tasks:
a) Use `pd.crosstab` to count how many users fall into each `Age_Range` and `Platform`.  
b) Normalize the cross-tab by rows to show the percentage distribution of platforms within each age range.  
c) Create another cross-tab showing the count of users by `Age_Range` and `Preferred_Feature`.  
d) Which feature is most popular among users aged 25-34?

In [None]:
# dataset
data = {
    'Age_Range': ['18-24', '25-34', '35-44', '18-24', '25-34', '35-44', '25-34', '18-24'],
    'Platform': ['iOS', 'Android', 'iOS', 'Android', 'iOS', 'Android', 'iOS', 'iOS'],
    'Preferred_Feature': ['Dark Mode', 'Notifications', 'Offline Access', 'Dark Mode', 'Dark Mode', 'Offline Access', 'Notifications', 'Notifications']
}

df = pd.DataFrame(data)

# a) Cross-tab: count users by Age_Range and Platform
crosstab_age_platform = pd.crosstab(df['Age_Range'], df['Platform'])

# b) Normalize cross-tab by rows (percentage distribution within each age range)
crosstab_age_platform_norm = pd.crosstab(df['Age_Range'], df['Platform'], normalize='index') * 100

# c) Cross-tab: count users by Age_Range and Preferred_Feature
crosstab_age_feature = pd.crosstab(df['Age_Range'], df['Preferred_Feature'])

# d) most popular feature among users aged 25-34
most_popular_feature_25_34 = crosstab_age_feature.loc['25-34'].idxmax()

# outputs
print(crosstab_age_platform)
print(crosstab_age_platform_norm)
print(crosstab_age_feature)
print(f"\nMost popular feature among users aged 25-34: {most_popular_feature_25_34}")





Platform   Android  iOS
Age_Range              
18-24            1    2
25-34            1    2
35-44            1    1
Platform     Android        iOS
Age_Range                      
18-24      33.333333  66.666667
25-34      33.333333  66.666667
35-44      50.000000  50.000000
Preferred_Feature  Dark Mode  Notifications  Offline Access
Age_Range                                                  
18-24                      2              1               0
25-34                      1              2               0
35-44                      0              0               2

Most popular feature among users aged 25-34: Notifications


## Exercise 6: Cross-Tabulation

### Dataset: [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)  
Use the dataset `netflix_titles.csv`.

a) Create a cross-tabulation of **type** (Movie or TV Show) by **rating** (e.g., TV-MA, PG, etc.).  

b) Compute the percentage distribution of each **rating** within **each content type**.  

c) Add total margins (rows/columns) to the crosstab.  

d) Identify the **most frequent rating** for TV Shows and compare it to the most frequent for Movies.

In [None]:
# Load the Netflix Titles dataset
df = pd.read_csv('netflix_titles.csv')

# a) Cross-tab: type (Movie or TV Show) by rating
crosstab_type_rating = pd.crosstab(df['type'], df['rating'])

# b) Percentage distribution of each rating within each content type
crosstab_type_rating_norm = pd.crosstab(df['type'], df['rating'], normalize='index') * 100

# c) Cross-tab with margins (totals)
crosstab_type_rating_margins = pd.crosstab(df['type'], df['rating'], margins=True, margins_name='Total')

# d) Identify most frequent rating for TV Shows and Movies
most_frequent_rating_tv = crosstab_type_rating.loc['TV Show'].idxmax()
most_frequent_rating_movie = crosstab_type_rating.loc['Movie'].idxmax()

# outputs
print(crosstab_type_rating)
print(crosstab_type_rating_norm)
print(crosstab_type_rating_margins)
print(f"\nMost frequent rating for TV Shows: {most_frequent_rating_tv}")
print(f"Most frequent rating for Movies: {most_frequent_rating_movie}")




rating   66 min  74 min  84 min   G  NC-17  NR   PG  PG-13    R  TV-14  TV-G  \
type                                                                           
Movie         1       1       1  41      3  75  287    490  797   1427   126   
TV Show       0       0       0   0      0   5    0      0    2    733    94   

rating   TV-MA  TV-PG  TV-Y  TV-Y7  TV-Y7-FV  UR  
type                                              
Movie     2062    540   131    139         5   3  
TV Show   1145    323   176    195         1   0  
rating     66 min    74 min    84 min         G     NC-17        NR        PG  \
type                                                                            
Movie    0.016316  0.016316  0.016316  0.668951  0.048948  1.223691  4.682656   
TV Show  0.000000  0.000000  0.000000  0.000000  0.000000  0.186986  0.000000   

rating      PG-13          R      TV-14      TV-G      TV-MA      TV-PG  \
type                                                                      
