# Data Analysis Template

1. Find and Load Data
2. Initial Exploration
3. Initial Questions
4. Clean Data / Transform Data
5. Context
6. Hypothesis Test
7. Regression

## Load Data

* What kind of data do you have?
  * Do you have enough data to do basic tests?
  * Is your data reliable / trustworthy?
    * Where did the data come from?
  * Is your data representative of what you're analyzing?
    * (e.g. the average height of the professional athletes may not be
      representative of the average height of the general population)
* What format is your data in?
  * Common types that can be read by Pandas
    * CSV
    * Excel
    * Json
      * Will require more effort defining loading parameters
  * SQL data can be loaded into a DataFrame after you connect to the database
    with another library (e.g. psycopg2, sqlalchemy)

Import libraries

In [41]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

Testing getting file path before file using os

In [44]:
database = '../users/amheard0311/data/train.csv'
parent = os.path.dirname(database)
parent

'../users/amheard0311/data'

Import data set

In [2]:
socialmedia_db = pd.read_csv('../data/train.csv')
socialmedia_db.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,991,992,993,994,995,996,997,998,999,1000
User_ID,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
Age,25,30,22,28,33,21,27,24,29,31,...,27,32,24,29,26,33,22,35,28,27
Gender,Female,Male,Non-binary,Female,Male,Male,Female,Non-binary,Female,Male,...,Non-binary,Female,Male,Female,Male,Non-binary,Female,Male,Non-binary,Female
Platform,Instagram,Twitter,Facebook,Instagram,LinkedIn,Instagram,Twitter,Facebook,LinkedIn,Instagram,...,Facebook,Whatsapp,Telegram,Snapchat,Instagram,Twitter,Facebook,Whatsapp,Telegram,Snapchat
Daily_Usage_Time (minutes),120.0,90.0,60.0,200.0,45.0,150.0,85.0,110.0,55.0,170.0,...,50.0,105.0,75.0,95.0,150.0,85.0,70.0,110.0,60.0,120.0
Posts_Per_Day,3.0,5.0,2.0,8.0,1.0,4.0,3.0,6.0,2.0,5.0,...,1.0,4.0,3.0,2.0,5.0,4.0,1.0,3.0,2.0,4.0
Likes_Received_Per_Day,45.0,20.0,15.0,100.0,5.0,60.0,30.0,25.0,10.0,80.0,...,10.0,55.0,37.0,23.0,70.0,35.0,14.0,50.0,18.0,40.0
Comments_Received_Per_Day,10.0,25.0,5.0,30.0,2.0,15.0,10.0,12.0,3.0,20.0,...,4.0,25.0,16.0,10.0,25.0,18.0,6.0,25.0,8.0,18.0
Messages_Sent_Per_Day,12.0,30.0,20.0,50.0,10.0,25.0,18.0,22.0,8.0,35.0,...,10.0,25.0,22.0,28.0,30.0,18.0,10.0,25.0,18.0,22.0
Dominant_Emotion,Happiness,Anger,Neutral,Anxiety,Boredom,Happiness,Anger,Sadness,Neutral,Happiness,...,Boredom,Anger,Neutral,Sadness,Anxiety,Boredom,Neutral,Happiness,Anger,Neutral


## Initial Exploration

* Learn about your data
  * What types of information do you have?
  * How many missing / nan values are in the data?
  * What data types are the different columns?
  * Are there any noticeable patterns in the data?
* Useful tools
  * .describe()
  * .info()
  * Scatter matrix / Pairplot
  * .corr()

shape()

In [3]:
socialmedia_db.shape

(1001, 10)

describe()

In [4]:
# Code Here
socialmedia_db.describe()

Unnamed: 0,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,95.95,3.321,39.898,15.611,22.56
std,38.850442,1.914582,26.393867,8.819493,8.516274
min,40.0,1.0,5.0,2.0,8.0
25%,65.0,2.0,20.0,8.0,17.75
50%,85.0,3.0,33.0,14.0,22.0
75%,120.0,4.0,55.0,22.0,28.0
max,200.0,8.0,110.0,40.0,50.0


info()

In [5]:
socialmedia_db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     1001 non-null   object 
 1   Age                         1001 non-null   object 
 2   Gender                      1000 non-null   object 
 3   Platform                    1000 non-null   object 
 4   Daily_Usage_Time (minutes)  1000 non-null   float64
 5   Posts_Per_Day               1000 non-null   float64
 6   Likes_Received_Per_Day      1000 non-null   float64
 7   Comments_Received_Per_Day   1000 non-null   float64
 8   Messages_Sent_Per_Day       1000 non-null   float64
 9   Dominant_Emotion            1000 non-null   object 
dtypes: float64(5), object(5)
memory usage: 78.3+ KB


In [30]:
socialmedia_db.corr(numeric_only=True)

Unnamed: 0,Age,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day
Age,1.0,0.087297,0.028109,0.058818,0.098809,0.107531
Daily_Usage_Time (minutes),0.087297,1.0,0.889205,0.94134,0.89692,0.916234
Posts_Per_Day,0.028109,0.889205,1.0,0.917814,0.917309,0.875708
Likes_Received_Per_Day,0.058818,0.94134,0.917814,1.0,0.931057,0.910046
Comments_Received_Per_Day,0.098809,0.89692,0.917309,0.931057,1.0,0.882783
Messages_Sent_Per_Day,0.107531,0.916234,0.875708,0.910046,0.882783,1.0


## Initial Question

* Initial Question
  * What types answers do you want the data to provide?
  * Will **THIS** data set be able to inform that question?
* During your exploration of the data did any patterns or quirks provoke any
  questions related to initial question

## Clean Data / Transform Data

### Cleaning Your Data

* Did you split your data in training and testing sets?
  * Why is it usually better but harder to split before cleaning?
    * Are your cleaning choices affecting the data
      * e.g. if you impute values with the mean of a column, is the mean or
        median of that column now different?
  * If you split data before cleaning, how do you make sure that there are a
    consistent set of steps being done for both the training and testing sets?
      * Hint: functions and classes will perform a consistent set of steps
* Are there any null values?
  * How many?
  * In which columns?
* Are there any columns with mostly missing values?
* Why are there missing values?
  * Are values missing at random?
  * Are values missing for a logical reason connected with the data or its collection?
* Are there any rows or columns with typos or incorrect data
  * (e.g. "?", "0" as opposed to 0, "Don't know", ...)
* Are the data types of the columns what you expect?
  * (e.g. dates that are actually strings, numbers pandas interpreted as strings)
* Are there extreme outliers (extreme values) in your data?
  * How should you deal with these?
  * How do they relate to your question? Do you need to keep them?

*Data already split into training and test data

Noticed that age and gender are occasionally swapped, writing code to correct this

In [6]:
# Function identifying if the value can be converted into a integer or not
def can_convert_to_int(value):
    try:
        int(value)
        return True
    except ValueError:
        return False

# Condition to identify rows where values need to be swapped using above function
condition = socialmedia_db.apply(lambda row: can_convert_to_int(row['Gender']), axis=1)

# Using the specified condition, swap the values into the proper column
socialmedia_db.loc[condition, ['Age', 'Gender']] = socialmedia_db.loc[condition, ['Gender', 'Age']].values

# Save this to a new clean csv
socialmedia_db.to_csv('../data/clean_train_1.csv', index=False)
socialmedia_db = pd.read_csv('../data/clean_train_1.csv')

Looking for nulls

In [7]:
# One null found, dropping any rows without full information, given how few columns there are, each value is important to have
socialmedia_db.dropna(inplace=True)

*Looking for potential non-null empty values, appears all is in order

Validating data types

In [8]:
# Age is still seen as an object, forcing them into int16 (memory saving in case the dataset is larger in future cases as age will most likely never exceed 32767)
socialmedia_db['Age'] = socialmedia_db['Age'].astype('int16')
socialmedia_db.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 1000
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     1000 non-null   object 
 1   Age                         1000 non-null   int16  
 2   Gender                      1000 non-null   object 
 3   Platform                    1000 non-null   object 
 4   Daily_Usage_Time (minutes)  1000 non-null   float64
 5   Posts_Per_Day               1000 non-null   float64
 6   Likes_Received_Per_Day      1000 non-null   float64
 7   Comments_Received_Per_Day   1000 non-null   float64
 8   Messages_Sent_Per_Day       1000 non-null   float64
 9   Dominant_Emotion            1000 non-null   object 
dtypes: float64(5), int16(1), object(4)
memory usage: 80.1+ KB


Searching for outliers and their validity

In [9]:
for col in socialmedia_db:
    print(col, socialmedia_db[col].max())

User_ID 999
Age 35
Gender Non-binary
Platform Whatsapp
Daily_Usage_Time (minutes) 200.0
Posts_Per_Day 8.0
Likes_Received_Per_Day 110.0
Comments_Received_Per_Day 40.0
Messages_Sent_Per_Day 50.0
Dominant_Emotion Sadness


### Transforming Your Data

* Do you need to scale your data for your analysis?
  * Hypothesis testing - usually no
    * For example, t-test should usually not be scaled before they are
      performed because the variance in the data is part of the interpretation
  * Machine Learning - usually yes
* Do you have any categorical variables that need to be dummy encoded
  * For example, yellow, blue, red colors to three columns like is_yellow,
    is_blue, is_red that have True or False (1 / 0)

Beginning of split for analysis and predictions through machine learning

#### Machine Learning Split

In [10]:
# User ID isn't needed for predictions
socialmedia_ml_db = socialmedia_db.drop("User_ID", axis=1, inplace=True)

# Getting dummies for all categorial variables
socialmedia_ml_db = pd.get_dummies(socialmedia_db)
socialmedia_ml_db


Unnamed: 0,Age,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Gender_Female,Gender_Male,Gender_Non-binary,Platform_Facebook,...,Platform_Snapchat,Platform_Telegram,Platform_Twitter,Platform_Whatsapp,Dominant_Emotion_Anger,Dominant_Emotion_Anxiety,Dominant_Emotion_Boredom,Dominant_Emotion_Happiness,Dominant_Emotion_Neutral,Dominant_Emotion_Sadness
0,25,120.0,3.0,45.0,10.0,12.0,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,30,90.0,5.0,20.0,25.0,30.0,False,True,False,False,...,False,False,True,False,True,False,False,False,False,False
2,22,60.0,2.0,15.0,5.0,20.0,False,False,True,True,...,False,False,False,False,False,False,False,False,True,False
3,28,200.0,8.0,100.0,30.0,50.0,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,33,45.0,1.0,5.0,2.0,10.0,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,33,85.0,4.0,35.0,18.0,18.0,False,False,True,False,...,False,False,True,False,False,False,True,False,False,False
997,22,70.0,1.0,14.0,6.0,10.0,True,False,False,True,...,False,False,False,False,False,False,False,False,True,False
998,35,110.0,3.0,50.0,25.0,25.0,False,True,False,False,...,False,False,False,True,False,False,False,True,False,False
999,28,60.0,2.0,18.0,8.0,18.0,False,False,True,False,...,False,True,False,False,True,False,False,False,False,False


#### Analysis Split

In [17]:
socialmedia_db.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,991,992,993,994,995,996,997,998,999,1000
Age,25,30,22,28,33,21,27,24,29,31,...,27,32,24,29,26,33,22,35,28,27
Gender,Female,Male,Non-binary,Female,Male,Male,Female,Non-binary,Female,Male,...,Non-binary,Female,Male,Female,Male,Non-binary,Female,Male,Non-binary,Female
Platform,Instagram,Twitter,Facebook,Instagram,LinkedIn,Instagram,Twitter,Facebook,LinkedIn,Instagram,...,Facebook,Whatsapp,Telegram,Snapchat,Instagram,Twitter,Facebook,Whatsapp,Telegram,Snapchat
Daily_Usage_Time (minutes),120.0,90.0,60.0,200.0,45.0,150.0,85.0,110.0,55.0,170.0,...,50.0,105.0,75.0,95.0,150.0,85.0,70.0,110.0,60.0,120.0
Posts_Per_Day,3.0,5.0,2.0,8.0,1.0,4.0,3.0,6.0,2.0,5.0,...,1.0,4.0,3.0,2.0,5.0,4.0,1.0,3.0,2.0,4.0
Likes_Received_Per_Day,45.0,20.0,15.0,100.0,5.0,60.0,30.0,25.0,10.0,80.0,...,10.0,55.0,37.0,23.0,70.0,35.0,14.0,50.0,18.0,40.0
Comments_Received_Per_Day,10.0,25.0,5.0,30.0,2.0,15.0,10.0,12.0,3.0,20.0,...,4.0,25.0,16.0,10.0,25.0,18.0,6.0,25.0,8.0,18.0
Messages_Sent_Per_Day,12.0,30.0,20.0,50.0,10.0,25.0,18.0,22.0,8.0,35.0,...,10.0,25.0,22.0,28.0,30.0,18.0,10.0,25.0,18.0,22.0
Dominant_Emotion,Happiness,Anger,Neutral,Anxiety,Boredom,Happiness,Anger,Sadness,Neutral,Happiness,...,Boredom,Anger,Neutral,Sadness,Anxiety,Boredom,Neutral,Happiness,Anger,Neutral


In [16]:
unique_genders = socialmedia_db["Gender"].unique()
unique_platforms = socialmedia_db["Platform"].unique()
unique_emotions = socialmedia_db["Dominant_Emotion"].unique()
print(f"Gender: {unique_genders},\nPlatform: {unique_platforms},\nEmotions: {unique_emotions}")

Gender: ['Female' 'Male' 'Non-binary'],
Platform: ['Instagram' 'Twitter' 'Facebook' 'LinkedIn' 'Whatsapp' 'Telegram'
 'Snapchat'],
Emotions: ['Happiness' 'Anger' 'Neutral' 'Anxiety' 'Boredom' 'Sadness']


In [29]:
socialmedia_db.groupby("Gender")["Dominant_Emotion"].value_counts()

Gender      Dominant_Emotion
Female      Happiness           110
            Anxiety              60
            Anger                60
            Neutral              60
            Sadness              50
            Boredom              30
Male        Happiness            70
            Anxiety              60
            Sadness              60
            Boredom              60
            Anger                60
            Neutral              50
Non-binary  Neutral              90
            Anxiety              50
            Boredom              50
            Sadness              50
            Happiness            20
            Anger                10
Name: count, dtype: int64

## Context

* What information or visualizations does the audience need to understand what
  your analysis is?
* What information or visualizations does the audience need to understand why
  your analysis is important?

Emotional status based on gender

Emotional Status regardless of gender, dependant on age

Correlation matrix of age, gender, daily usage, emotional status

Correlation matrix of daily usage, posts, likes, messages, comment, emotional status

Correlations of each platform, emotional status

## Hypothesis Testing

* What is the hypothesis you want to test?
  * What is the null hypothesis you want to compare to?
* Do you have two logical groups or categories in the data that you want to compare?
* Would bootstrapping be useful to compare the two groups?
  * Is the data very small and/or oddly distributed?
  * Do you need to focus on how likely or unlikely it is for extreme values to
    occur in the data?
* How confident do you want to be in your conclusions?
  * What p-value threshold should you be under?

Hypothesis

Null: Time spent on social media has no effect on emotion.

Alternate: The more time spent on social media, the more of a negative effect there is on emotion.

In [13]:
# Code Here

## Regression and Predictions

### Model Building

* Does the data meet the assumptions of linear or logistic regression models?
  * For example
    * Linearity
    * Homoscedasticity

In [14]:
# Code Here

### Model Evaluation

* How did the model perform?
  * Did you use tools like those below to evaluate your model?
    * R-square (R2)
    * Confusion matrix
    * Mean squared error (MSE)
* Did you split your data into training and testing?
* Does your model perform equally or much worse on your testing data?
  * Could your model be overfit?
* How will this model likely perform in the real world?
* What are you coefficients?
  * What do the coefficients tell us about which feature is most important?
  * Can we trust the coefficients if the data isn't scaled?

In [15]:
# Code Here