# Project 2 

## Part 1: File Structure

Give your project a proper file structure.  There should be at least one (but maybe more!) folders.  Some possible folders:
- Data or dat
- Scripts or scr
- Documents or doc 
- Results
- Clean_Data

Think about what should live in your root:  at the minimum, LICENCE, README, .gitignore

Where should your Citations live?  It's your call, but you need to include this somewhere that makes sense.

If your data is in a different folder than your scripts, you may need to use pd.read_csv("../data/data.csv") as your relative path.

You will be using only relative paths, not absolute paths.  This means that anyone else that forks your project can run everything, without having to change the path.

Don't merge in your branch until I've released grades (for THIS part, not last week)

Take a screen grab of the local version of your project, as the data is in the .gitignore file.

Do you regret naming your repo "BABI 4005 Baby Project"?  Try renaming it!  Rename your analysis file to something more descriptive, if you would like.  

## Part 2: Joins
Find another dataset that will join with your original data.  This can be anything that you would like, as long as there is a key in common.

If you cannot find any real data, you may wish to generate some fake data - Just cite that.  This is a perfect task for GenAI

Make a new script that opens up your clean data from last week, and joins the new dataset, to make one big one.  In a markdown cell, explain why you chose the type of merge you chose, and what type of NA values are created (or not!)

Save the new  dataset.

In [1]:
#Import pandas, matppotlib, and seaborn libraries
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 


#Display plots directly below the code cell 
%matplotlib inline

In [2]:
# Load the dataset
df = pd.read_csv('social_media_vs_productivity.csv')
df1 = df[['age', 'gender','job_type','social_platform_preference','work_hours_per_day','perceived_productivity_score','actual_productivity_score']]
df2 = pd.read_csv("social_media_recommendations.csv") 

# View the first 5 rows of df1
df1.head()

Unnamed: 0,age,gender,job_type,social_platform_preference,work_hours_per_day,perceived_productivity_score,actual_productivity_score
0,56,Male,Unemployed,Facebook,6.753558,8.040464,7.291555
1,46,Male,Health,Twitter,9.169296,5.063368,5.165093
2,32,Male,Finance,Twitter,7.910952,3.861762,3.474053
3,60,Female,Unemployed,Facebook,6.355027,2.916331,1.774869
4,25,Male,IT,Telegram,6.214096,8.868753,


In [3]:
# View the first 5 rows of df2
df2.head()

Unnamed: 0,social_platform_preference,suggested_social_media_time,suggested_active_time,suggested_breaks_per_day,suggested_weekly_offline_hours,wellness_priority
0,TikTok,3.1,7.5,3,93.8,Low
1,Telegram,2.1,7.5,3,100.8,Medium
2,Instagram,2.9,7.5,3,95.2,Medium
3,Twitter,2.6,7.5,3,97.3,Medium
4,Facebook,2.7,7.5,3,96.6,Medium


In [6]:
# Merging the df1 and df2 using a left join on the social_platform_preference column and creating df3
df3 = pd.merge(df1, df2, how ='left', on='social_platform_preference')

# View the first 5 rows of df3
df3.head()

Unnamed: 0,age,gender,job_type,social_platform_preference,work_hours_per_day,perceived_productivity_score,actual_productivity_score,suggested_social_media_time,suggested_active_time,suggested_breaks_per_day,suggested_weekly_offline_hours,wellness_priority
0,56,Male,Unemployed,Facebook,6.753558,8.040464,7.291555,2.7,7.5,3,96.6,Medium
1,46,Male,Health,Twitter,9.169296,5.063368,5.165093,2.6,7.5,3,97.3,Medium
2,32,Male,Finance,Twitter,7.910952,3.861762,3.474053,2.6,7.5,3,97.3,Medium
3,60,Female,Unemployed,Facebook,6.355027,2.916331,1.774869,2.7,7.5,3,96.6,Medium
4,25,Male,IT,Telegram,6.214096,8.868753,,2.1,7.5,3,100.8,Medium


## Merge Explanation

There are 2 datasets. 

df1 is a condensed version of df, which is the original social media vs productivity table. 

df2 is a recommendation table of social media health recommendations based on the social media preference of the record. This was AI generated. 

The join I used was the left join with df1 in the left position and df2 in the right position. 

I chose this left join and df1 in the left position to maintain the readability of the table. In df1 there is no table ID, the combination of Age, Gender and Social Platform Preference provides the unique context for each record in df1. The left join allows df2 records to match each df1 records. This provides a table that gives context to the relationship between social media, productivity and health recommendations. 

## Nan Value Explanation

In [8]:
# Summation of all null values in each column 
df3.isnull().sum()

age                                  0
gender                               0
job_type                             0
social_platform_preference           0
work_hours_per_day                   0
perceived_productivity_score      1614
actual_productivity_score         2365
suggested_social_media_time          0
suggested_active_time                0
suggested_breaks_per_day             0
suggested_weekly_offline_hours       0
wellness_priority                    0
dtype: int64

There were no Nan values created as a result of this merge, because each social platform preference value in df1 had a corresponding recommendation record in df2. The Nan values in df3 are from the original table where respondents did not provide a response. 