## Introduction to Data Analysis and Visualisation


In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

**Reading CSV file using panda library**

In [None]:
data = pd.read_csv('pseudo_facebook.csv')

**Checking first and last five rows**
* `.head()`: Observing the first 5 rows 

In [None]:
data.head()

<font color=blue> <b> Practice 1 </b> </font>

In [None]:
# How to retrieve the last 5 rows in dataset




**Checking the total rows and columns**

In [None]:
print(data.shape)

## Data Cleaning
* Clean off null and unnecessary data



**Checking the type of variables**
* `.dtypes` provides the type of variables for each column
* `.info()` provides information of missing values, the data type, number of rows as well as memory usage

In [None]:
# method 1
data.dtypes

In [None]:
# method 2
data.info()

**Checking of columns name**
* `df.columns.unique()` gives the unique column names. Check if there is unnecessary spaces in the column names and replace it accordingly. 

In [None]:
data.columns.unique()

In [None]:
# convert gender to categorical variable
data['gender'] = data['gender'].astype('category')

<font color=blue> <b> Practice 2 </b> </font>

In [None]:
# Convert variable ‘age’ into float64 




In [None]:
data.info()

**Checking of null data**

In [None]:
data.isnull().sum()

**Removal of null data**

In [None]:
data.dropna(how='any', inplace=True)

Parameter: how{‘any’, ‘all’}, default ‘any’
* ‘any’ : If any NA values are present, drop that row or column.
* ‘all’ : If all values are NA, drop that row or column.

In [None]:
data.isnull().sum()

In [None]:
data.notnull().sum()

**Removal of unnecessary columns**
* Since there is age given, the information on dob_day,dob_year and dob_month are unnecessary. The userid is not important as well.

In [None]:
data1 = data.drop(['userid','dob_day','dob_month','dob_year'],axis = 1) 

* axis = 1 indicates drop the whole column

In [None]:
data1.head()

<font color=blue> <b> Practice 3 </b> </font>

In [None]:
# drop first row of the dataset


# drop first 10 row of the dataset
 

# drop row 20 and 22 of the dataset


# perform head count


<font color=blue> <b> Practice 4 </b> </font>

In [None]:
# drop first 10 row of the dataset



**Summary of data**
* `.describe()`:
* For variables that are float64 or int64, it will provide a statistical summaries of the variables, including `count`,`mean`,`std`,`min`,`25%`,`50%`,`75%` and `max`
* For categorical variables it will provide information like `count`, `unique`,`top` and `freq`

In [None]:
data1.describe()

In [None]:
data1['gender'].describe()

* `count`: total no. of data points for this category variable  
* `unique`: no. of different levels for this category variable (E.g. 2 for male & female)  
* `top`: level with the maximum occurance  
* `freq`: no. of times of occurance of the 'top' level

In [None]:
data1['gender'].value_counts()

## Data Exploration and Manipulation
* Use of conditional filter
* Adding columns to existing DataFrame
* Joining dataframes
* Groupby


**Conditional filter** : filter by age
* Teenagers: <18
* Young Adults: Between 18 to 25
* Adults: Between 26 to 35
* Middle Aged Adults: Between 36 to 50
* Older Adults: Between 51 to 60
* Senior citizen: Above 60 

In [None]:
# creating filters conditions
teenager = (data1['age'] <18)
young_adult = (data1['age']>= 18) & (data1['age']<= 25)
adult = (data1['age']>= 26) & (data1['age']<= 35)
middle_age = (data1['age']>= 36) & (data1['age']<= 50)
older_adult = (data1['age']>= 51) & (data1['age']<= 60)
senior_citizen = (data1['age'] >60)

* & : AND
* | : OR
* == : equal to 
* != : not equal to 

Additional information: https://www.w3schools.com/python/python_operators.asp

In [None]:
data2 = pd.DataFrame(data1.loc[teenager])
data3 = pd.DataFrame(data1.loc[young_adult])
data4 = pd.DataFrame(data1.loc[adult])
data5 = pd.DataFrame(data1.loc[middle_age])
data6 = pd.DataFrame(data1.loc[older_adult])
data7 = pd.DataFrame(data1.loc[senior_citizen])

<font color=blue> <b> Practice 5 </b> </font>

In [None]:
# create group1: age not equal to 18
group1 = 
group1

# create a new dataframe for this group 
df1 = 
df1

In [None]:
# Alternatively
# to create a group that only contains female user
df2 = data1[data1.gender != 'male']
df2

**Adding new column 'age group'**


In [None]:
data2.insert(1, "age group", 'teenager')
data3.insert(1, "age group", 'young adult')
data4.insert(1, "age group", 'adult')
data5.insert(1, "age group", 'middle aged')
data6.insert(1, "age group", 'older adult')
data7.insert(1, "age group", 'senior citizen')

In [None]:
data2.head()

### Merging dataframes together


In [None]:
data_Final = data2.append([data3,data4,data5,data6,data7])
data_Final

**Merging by column**
* To demo merging by column `'dob_year'`  from original data

In [None]:
x = pd.DataFrame(data['dob_year'])
result = pd.concat([data_Final, x], axis=1)
result

### Groupby age group and gender to find statistical summary

In [None]:
data_Final.groupby(by='gender').std()

In [None]:
data_Final.groupby(by='gender').median()

In [None]:
data_Final.groupby(by='gender').mean()

In [None]:
data_Final.groupby(by='gender').describe()

# Data visualisation
**Python libraries for data visualisation: matplotlib.pyplot & seaborn**  
**1. Data visualisation for single variable**  
**2. Data visualisation for two or multiple variables**

## Data visualisation for single variable

**A. Countplot (For categorical variables)**

In [None]:
# Count plot using seaborn
f, axes = plt.subplots(1, 1, figsize=(6, 6))
count = sb.countplot(x="age group", data=data_Final )
count.set(xlabel='Age group', ylabel='Number of users')
count.set_xticklabels(count.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

In [None]:
# Countplot using matplotlib
f, axes = plt.subplots(1, 1, figsize=(6, 6))
chart = data_Final['age group'].value_counts().plot(kind='bar')
plt.ylabel('Number of users')
plt.xlabel('Age group')
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

**B. Histogram (For numerical variables)**

In [None]:
# Histogram
f, axes = plt.subplots(1, 1, figsize=(6, 6))
plt.hist(data_Final['age'])
plt.ylabel('Number of facebook user')
plt.xlabel('Age of facebook user')
plt.title('No. of user versus age')
plt.show()

**C. Boxplot**

In [None]:
# Boxplot
f, axes = plt.subplots(1, 1, figsize=(6, 4))
sb.boxplot(data_Final['age'], orient = "h", color = 'r')

**D. Distribution plot**

In [None]:
# Distribution plot
f, axes = plt.subplots(1, 1, figsize=(6, 6))
sb.distplot(data_Final['age'], color = 'g')

**E. Violin plot**

In [None]:
# Violin plot
f, axes = plt.subplots(1, 1, figsize=(6, 6))
sb.violinplot(data_Final['age'], color = 'b')

<font color=blue> <b> Practice 6 </b> </font>   
Create a boxplot for the variable `tenure` by changing the size (adjust 'figsize' at canvas) and color.

Color code:
Black = 'k' ; 
Blue = 'b' ; 
Red = 'r' ; 
Green = 'g' ; 
Yellow = 'y' ; 
Cyan = 'c' ; 
Magenta = 'm'

## Data visualisation for two variables
**Between 2 numerical variables**  
**A. Scatterplot**

In [None]:
# Scatterplot
f, axes = plt.subplots(1, 1, figsize=(8, 8))
sb.scatterplot(data=data_Final, x='friendships_initiated', y='friend_count')

**B. Jointplot**

In [None]:
sb.jointplot(data=data_Final, x='friendships_initiated', y='friend_count', height=8)

**Between a categorical variable and a numerical variable**  
**C. Catplot**

In [None]:
sb.catplot(x = 'age group', y = 'tenure' , data = data_Final, kind = "violin", height = 8)

<font color=blue> <b> Practice 7 </b> </font>   
Create a catplot between `gender` and `tenure` by changing the attribute 'kind' to 'box' and see the difference.

### Data Visualisation (For multiple variables)

In [None]:
data_Final_multi = pd.DataFrame(data_Final[['age', 'tenure', 'friend_count', 'friendships_initiated', 'likes','likes_received']])
data_Final_multi.head()

**A. Heatmap**

In [None]:
f, axes = plt.subplots(1, 1, figsize=(20, 10))

sb.heatmap(data_Final_multi.corr(), vmin = -1, vmax = 1, linewidths = 1,
           annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")
axes.set_ylim(data_Final_multi.shape[1], 0) 

This overview heatmap allows us to see the the correlation between all the variables that are in the dataset. 
* Positive correlation means when one variable increases, the other will increase as well
* Negative correlation means when one variable increases, the other will decrease

**B. Pairplot**

In [None]:
#sb.pairplot(data = data_Final_multi)

## Additional Information

Note: In order to perform following, please make sure you have updated your seaborn to version 0.11.0
You can run the command below to update your seaborn

In [None]:
pip install seaborn==0.11.0

### 1. Age vs Tenure

In [None]:
# Combine a distribution plot and a FacetGrid 
sb.displot(data_Final, x='tenure', col="age group", multiple="dodge")

In [None]:
sb.displot(data_Final, x='tenure', hue='age group', multiple='stack')

In [None]:
sb.displot(data_Final, x='tenure', hue='age group', element='poly')

In [None]:
sb.displot(data_Final, x='tenure', col="gender", multiple="dodge")

In [None]:
age_tenure=pd.DataFrame(data_Final[['age','tenure']])

To extract age and tenure from the data set

In [None]:
sb.catplot(y = 'age', data = age_tenure, kind = "count", height = 20)

From the catplot, we are able to the there are 2 distint group of user ranging from 13-30 years old and 40-70 years old.

In [None]:
age_tenure1 = age_tenure[age_tenure['age'] < 30 ]

In [None]:
f, axes = plt.subplots(1, 1, figsize=(16, 8))
sb.boxplot(x = 'age', y = 'tenure', data = age_tenure1)

In [None]:
age_tenure2 = age_tenure[age_tenure['age'].between(39, 71, inclusive=False)]

In [None]:
f, axes = plt.subplots(1, 1, figsize=(16, 8))
sb.boxplot(x = 'age', y = 'tenure', data = age_tenure2)

### 2. www_like & mobile_likes vs total likes

In [None]:
likesData = pd.DataFrame(data[['mobile_likes', 'www_likes', 'mobile_likes_received', 'www_likes_received', 'likes']])
likesData.head()

In [None]:
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "mobile_likes", y = "likes", data = likesData)

In [None]:
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "www_likes", y = "likes", data = likesData)


In [None]:
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "www_likes", y = "likes", data = likesData)

The two scatter plot for both the mobile likes and www likes against the total likes shows us which medium has a higher usage. Mobile likes and total likes sees more of their likes within a larger region and hence, we can briefly conclude that more users interacts with the like button from their mobile.

(explanation for this is uncertain, open for discussion and changes)

### 3. Interactive Visualisation

In [None]:
import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px

In [None]:
trace = go.Histogram(x = data_Final['age group'], histnorm = 'density')
layout = go.Layout(title = 'Number of Facebook users')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

# can compare with the non-interactive plot

In [None]:
fig = px.box(data_Final,x='gender',y='likes_received')
fig.show()