# Introduction

In Part II, we cleaned and merged the data that we got from Part I. We are now going to do some exploratory data analysis! 

Visualization helps with building intuition around our data and identifies possible outliers or anamolies. 

In this notebook, you will do the following:
1. Import pandas and data vizualization libraries
2. Visualize the column data
    - univariate analysis
    - bivariate analysis

Useful readings on visualization: 
<a href = "https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed">Introduction to Data Visualization in Python</a> (run it in Incognito Mode if you face the paywall)

We highly recommend this reading if you're new to data visualization.

### Step 1: Import the following libraries
- pandas
- matplotlib.pyplot as plt
- seaborn as sns

In [None]:
# Step 1: Import the following libraries

### Step 2: Import the CSV from Part II Step 18
We will import the CSV that we got from Part II, i.e. the CSV containing the merged data from studentInfo, studentAssessment, and studentVle. 

In [None]:
# Step 2: Read the CSV from Part II

## Univariate analysis (UA)
In this section, we will perform univariate analysis. We'll examine each column with either a histogram or a barplot. 

For categorical values, we will first get a frequent count followed by plotting of the barplot.

<strong>Hint: Google "pandas column barplot frequency"</strong>

### Step 3: Perform UA on 'code_module' with barplot
Let's start with plotting the frequency of 'code_module' with a barplot. This will tell us how many students are enrolled in the different modules, i.e. AAA, BBB, ... , GGG. 

There are two different ways to do it, i.e. sort, or not sorted. If you like a challenge, consider sorting the index so that you can get a plot that is sorted alphabetically rather than based on frequency. 

![FirstBarplot.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/FirstBarplot.png)

In [None]:
# Step 3: Plot a barplot for 'code_module'

### Step 4: Perform UA on 'gender' with barplot
Let's take a look at the gender makeup in the OULAD dataset.

What can you say about the proportion of males (M) vs females (F)?

In [None]:
# Step 4: Plot a barplot for 'gender'

### Step 5: Perform UA on 'region' with a horizontal barplot
Next up, let's see where our students are coming from.

The names might be long, so consider a horizontal barplot.

In [None]:
# Step 5: Plot a barplot for 'region'

### Step 6: Perform UA on 'highest_qualification' with a horizontal barplot
How about the students' qualifications? The Open University program takes in students from all walks of life so it'd be interesting to see where all of these students come from, academically. 

In [None]:
# Step 6: Plot a barplot for 'highest_qualification'

### Step 7: Perform UA on 'imd_band' with a horizontal barplot
IMD, which stands for Index of Multiple Deprivation, is an index that measure deprivation (a measure of poverty) of small areas within the UK. 

The higher the value of the imd_band, the better the living conditions.

Similarly, try sorting the index of the value_count first before plotting. 

In [None]:
# Step 7: Plot a barplot for 'imd_band'

### Step 8: Perform UA on 'age_band' with a barplot
Plot a barplot to look at the distribution of age in the dataset. 

In [None]:
# Step 8: Plot a barplot for 'age_band'

Having '<' and '=' in the column name is not recommended, since models belong to certain libraries are unable to accept those. We will make a mental note to fix this in Part IV. 

### Step 9: Perform UA on 'num_of_previous_attempts' with a value count
No need to visualize the column data, just tabulate the frequency of the values in this column.

In [None]:
# Step 9: Count the values in the num_of_previous_attempts column

### Step 10: Perform UA on 'studied_credits' with a histogram
Let's identify the distribution on the number of credits that students take in the dataset. 

In [None]:
# Step 10: Plot a histogram for 'studied_credits'

### Step 11: Perform UA on 'disability' with a barplot
How many students declared a disability? Let's find out with a barplot.

In [None]:
# Step 11: Plot a barplot for 'disability'

### Step 12: Perform UA on 'final_result' with a barplot
This is an important step since 'final_result' is our dependent variable in this dataset.

In [None]:
# Step 12: Plot a barplot for 'final_result'

Looks like there are four different kinds of outcomes for students in this program - we'll also keep this in mind later on in Part IV.

### Step 13: Perform UA on 'sum_click' with a histogram
Let's see how our students fare in terms of interacting with the VLE.

In [None]:
# Step 13: Plot a histogram for 'sum_click'

### Step 14: Perform UA on 'mean' with a histogram
What is the distribution of average scores of each student in the dataset?

In [None]:
# Step 14: Plot a histogram for 'mean'

### Step 15: Perform UA on 'max' with a histogram

In [None]:
# Step 15: Plot a histogram for 'max'

### Step 16: Perform UA on 'min' with a histogram

In [None]:
# Step 16: Plot a histogram for 'min'

## Bivariate Analysis (BA)
Next up, bivariate analysis. In univariate analysis, we looked at understanding each column by themselves. 

In this section, we will take a look at the relationship columns have with each other. Usually these relationships are between an indepedent variable, e.g., sum_clicks, and the dependent variable, i.e. final_result. 


Boxplots will be used a lot here, so head on up there if you need a refresher. In addition, we highly recommend using seaborn to plot boxplots. 

It's simply easier to plot boxplots with seaborn than with matplotlib.pyplot, but we won't stop you from practising.

### Step 17: Perform BA on sum_click vs final_result with boxplot
<blockquote>Is there an observable relationship between student activity and the final result?</blockquote>
Plot a boxplot to see if there's a pattern between sum_click and final results.

Consider using the showfliers parameter as well because you're bound to encounter outliers that may obscure the boxplot.

In [None]:
# Step 17: Plot sum_click vs final_result with boxplot

What can you say about the relationship between final results and sum of student activity?

### Step 18: Perform BA on mean vs final_result with boxplot
<blockquote>Is there a correlation between the students' average results and the final result?</blockquote>
Let's find out with a boxplot.

In [None]:
# Step 18: Plot mean vs final_result with boxplot

Looks like there is - keep this in mind for Part V.

### Step 19: Perform BA on age_band vs final_result with countplot
<blockquote>Can we split the final results based on age, and see if age affects outcomes?</blockquote>
Plot using seaborn's countplot method, and consider the following parameters:

1. data - your full DataFrame
2. x - 'final_result'
3. hue - 'age_band'

Hue is useful if you want to compare two categorical columns together.

In [None]:
# Step 19: Plot age_band vs final_result with countplot

The answer to this is encouraging - hope you find out what it is! 

### Step 20: Perform BA on disability vs final_result with countplot
<blockquote>Is there any observable relationship between the final results of the student, and whether he/she has a disability?</blockquote>
It's an interesting question, so let's use a countplot to find out!

In [None]:
# Step 20: Plot disability vs final_result with countplot

The answer to this is encouraging - hope you find out what it is! 

### Step 21: Perform BA on imd_band vs final_result with countplot
<blockquote>Does the relative poverty of an area affect the outcome of a student's final result?</blockquote>
The hypothesis would be that it's a yes - the IMD band does affect a student's final result.

Plot a countplot to find out. You might have to increase the plot figure size to have a better look at the plot.

In [None]:
# Step 21: Plot imd_band vs final_result with a countplot

Try to interpret the colorful plot - the data does support our hypothesis :/

### Step 22: Perform BA on gender vs final_result with countplot
<blockquote>Can we identify students' final results based on their gender?</blockquote>
Let's see if there is, with a countplot! 

In [None]:
# Step 22: Plot gender vs final_result with a countplot

### End of Part III
That was a lot of plotting, but we hope you persevered and made the suggested plots. 

Telling a data story through visualization is important, and it's good to practice. 

Next up, we will prepare our data in Part IV so that it can be used for machine learning modelling in Part V.