# Project 5 - Exploration Data Analysis and Visualization

Exploratory data analysis (EDA) is an approach for summarizing and visualizing the important characteristics of a data set. It gives us more understanding of the data’s underlying structure and variables before feature engineering, formal modeling, model tuning, and other data analysis techniques. In this mini project, you will be introduced to some ways to explore data efficiently with different packages so that you can develop intuition about your data set:
* Import and briefly check data with python data manipulation tools Pandas
* Get basic description of data, descriptive statistics, checking rows and columns.
* Time series analysis
* Simple predictive modeling
* Discover patterns in data by visualizing data with python data visualization packages sucha as Matplotlib, and Seaborn, or by using functions to compute the correlation between features.

### Dataset
* The data set we are going to be using is from a language learning application on smartphones. It contains user info, lexeme info and session info. We have already joined them into one single table for our analysis purpose.

### General Philosophy and Steps for this project
 - Data preparation: load all needed dependencies and packages, setup plot style
 - Data Import: load data to pandas dataframe and check
 - Post questions: propose hypothesis based on your intuition
 - Visualization: Test you intuition and hypothesis using python visualization packages, such as Matplotlib.

## Duolingo Exploratory Data Analysis

In [1]:
# Import all dependencies we need 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import datetime as dt
import seaborn as sns
%matplotlib inline 

In [2]:
# avaliable plot styles
print(plt.style.available)

['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']


### Import data

In [3]:
# Since this dataset is 1.3G and we only take a look of first 1000000 rows #
df = pd.read_csv('learning_traces.13.csv', nrows=1000000)

FileNotFoundError: [Errno 2] No such file or directory: 'learning_traces.csv'

In [4]:
# Print out the head of our dataset
df.head()

Unnamed: 0,p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct
0,1.0,1362076081,27649635,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,6,4,2,2
1,0.5,1362076081,27649635,u:FO,de,en,7dfd7086f3671685e2cf1c1da72796d7,die/die<det><def><f><sg><nom>,4,4,2,1
2,1.0,1362076081,27649635,u:FO,de,en,35a54c25a2cda8127343f6a82e6f6b7d,mann/mann<n><m><sg><nom>,5,4,1,1
3,0.5,1362076081,27649635,u:FO,de,en,0cf63ffe3dda158bc3dbd55682b355ae,frau/frau<n><f><sg><nom>,6,5,2,1
4,1.0,1362076081,27649635,u:FO,de,en,84920990d78044db53c1b012f5bf9ab5,das/das<det><def><nt><sg><nom>,4,4,1,1


In [5]:
# Check the infomation of our data, such as columns, data type #
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
p_recall             1000000 non-null float64
timestamp            1000000 non-null int64
delta                1000000 non-null int64
user_id              1000000 non-null object
learning_language    1000000 non-null object
ui_language          1000000 non-null object
lexeme_id            1000000 non-null object
lexeme_string        1000000 non-null object
history_seen         1000000 non-null int64
history_correct      1000000 non-null int64
session_seen         1000000 non-null int64
session_correct      1000000 non-null int64
dtypes: float64(1), int64(6), object(5)
memory usage: 91.6+ MB


In [6]:
# Dimension of dataset #
df.shape

(1000000, 12)

#### Please try to answer the following questions when you follow the data visualization procedures below:

 - How many users are there from every country?
 - How many different languages are being studied
 - Are there differences between users from different coutries
 - Time series analysis 
     - temporal behaviour of users
     - when do people study?
     - how often do they study
     - timestamp
 - Predictive modelling
     - corellations of p_recall with various features  

### Numerical data

In [7]:
# list all types that only take numerical ones
list(set(df.dtypes.tolist()))

[dtype('float64'), dtype('int64'), dtype('O')]

#### 1. Create a dataframe that only takes numerical data and show the head

In [None]:
# hint: df.select_dtypes() which you could select numerical data types


#### 2. Draw histgrams to show the distributions of all the numerical data from the dataframe we just created.

### Feature to feature relationship

Trying to plot all the numerical features in a pairplot will take us too much time and will be hard to interpret. We can try to see if some variables are linked between each other and then explain their relation with common sense.

#### 3. Compute pairwise correlation matrix of numerical columns and draw a heatmap using seaborn plot

hint: the heatmap may look like this
![image.png](attachment:image.png)

In [None]:
# hint：corr(), sns.heatmap()


### Categorical data - Countries

#### 1. What are user interface languages? 

#### 2. Do a value_counts() to see how popular each interface language is

#### 3. Draw a pie plot to visualize user interface language distribution with percentage on it

In [9]:
# hint: showing percentage, add autopct parameter
# Type you answer below 


#### 4. Draw a pie plot of languages that individuals are learnging

In [None]:
# Type you answer below 


#### 5. Pie plot to visualize what languages being learned by people whose interface language is English

In [None]:
# you want to show the learning language based on people whose interface language is English
# Type you answer below 


#### 6. Let's see what languages being learned by people whose interface languages are English, Spanish, Italian, and Potuguese by subplot four pie plots

In [None]:
# Type you answer below 


#### hint: result plot would look like this
![image.png](attachment:image.png)

## Users Activity patterns

#### 1. Let's see users activity in session by doing a value counts and do a summary statistics

In [10]:
# hint: summary statistics use describe() function 
# Type you answer below 


#### 2. Compute the duration of dataset

In [None]:
# hint: maximum timestamp minus minimum timestamp
# Type you answer below 


## Boxplots

#### 3. Compare activity levels for people with different 4 user interface languages by drawing boxplots

In [None]:
# hint: df[df.ui_language == 'en']['user_id'].value_counts()/duration
# Type you answer below 


#### 4. Do the same thing but setting ylim to 0-10, which plot would be seen more clearly

In [None]:
# Type you answer below 


####  5. Another way to zoom in instead of setting ylim is to take a log scale on the boxplot

In [None]:
# hint: log scale on y
# Type you answer below 


## Regression and predictive tasks

In [11]:
df.head()

Unnamed: 0,p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct
0,1.0,1362076081,27649635,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,6,4,2,2
1,0.5,1362076081,27649635,u:FO,de,en,7dfd7086f3671685e2cf1c1da72796d7,die/die<det><def><f><sg><nom>,4,4,2,1
2,1.0,1362076081,27649635,u:FO,de,en,35a54c25a2cda8127343f6a82e6f6b7d,mann/mann<n><m><sg><nom>,5,4,1,1
3,0.5,1362076081,27649635,u:FO,de,en,0cf63ffe3dda158bc3dbd55682b355ae,frau/frau<n><f><sg><nom>,6,5,2,1
4,1.0,1362076081,27649635,u:FO,de,en,84920990d78044db53c1b012f5bf9ab5,das/das<det><def><nt><sg><nom>,4,4,1,1


#### 1. Scatter plot the relation between session_seen and p_recall

In [None]:
# Type you answer below 


#### 2. Scatter plot the same distribution with log scale. 

In [None]:
# Type you answer below 


#### 3. Goupby session_seen and take aveage on it as x, and scatter plot relation with p_recall as y

In [None]:
# hint: use groupby(), and mean() functions
# Type you answer below 


#### 4. Goupby history_seen and take aveage on it as x, and scatter plot relation with p_recall as y

In [None]:
# Type you answer below 


#### 5. Scatter plot the relation history_seen and p_recall when session_seen is greater than 5

In [None]:
# Type you answer below 


#### 6. Hexbin plot the relation between history_seen and p_recal with log scale

In [None]:
# hint: hexbin()
# Type you answer below 


#### 7. Create a dataframe with user_id and user_activity based on user_id index

In [None]:
# create an empty dataframe and set index to user_id
# Type you answer below 


#### 8. Merge dataframe a onto original dataframe

In [None]:
# hint: merge(), pay attention to how to merge(inner, outer....)
# Type you answer below 


#### 9. Get summary statistics of p_recall when user_activity is less than 8 and greater than 59 respectively

In [None]:
# Type you answer below 


#### 10. Hexbin plot the relation between delta and p_recall with log scale

In [None]:
# Type you answer below 


#### hint: hexbin plot would look like this 
![image.png](attachment:image.png)

#### 11. Scatter plot relation between user_activity and delta

In [None]:
# Type you answer below 


#### 12. Hexbin plot the relation between user_activity and delta with log scale

In [None]:
# Type you answer below 


## Temporal patterns and Time Series

#### 1. Plot the general all users activity pattern to see what time people are pretty active during a day

In [None]:
# Type you answer below 


In [None]:
# You can use datetime library to convert timestamp to actual datetime
dt.datetime.fromtimestamp(df.timestamp.min())

#### 2. Visualize each interface language users daily activity pattern onto a single plot. 

In [12]:
# hint: use rolling window here
# Type you answer below 


#### hint: result plot would look like this
![image.png](attachment:image.png)