# EXPLORATORY DATA ANALYSIS IN PYTHON
## Exploring Student Data
Imagine that you work for a school district and have collected some data on local students and their parents. You’ve been tasked with answering some important questions:

- How are students performing in their math classes?
- What do students’ parents do for work?
- How often are students absent from school?
In this project, you’ll explore and summarize some student data in order to answer these questions.

Data citation:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Paulo Cortez, University of Minho, GuimarÃ£es, Portugal, http://www3.dsi.uminho.pt/pcortez

## Tasks


### Initial exploration
##### 1. The provided dataframe (saved as students) includes the following variables/features:

address: the location of the student’s home ('U' for urban and 'R' for rural)
absences: the number of times the student was absent during the school year
Mjob: the student’s mother’s job industry
Fjob: the student’s father’s job industry
math_grade: the student’s final grade in math, ranging from 0 to 20
Use the pandas .head() method to inspect the first few rows of data.


`Hint` <br>
`Print the first five rows using print(students.head()).`

##### 2. Use the pandas .describe() method to print out summary statistics for all five features in the dataset. Inspect the output. Do more students live in urban or rural locations?


`Hint` <br>
`Make sure to use include = 'all' to include all of the columns in the dataset.`

### Summarize a typical student grade
##### 3. Let’s start by trying to summarize the math_grade column. Calculate and print the mean value of math_grade.


`Hint` <br>
`Use the pandas method .mean().`

##### 4. Next, calculate and print the median value of math_grade. Compare this value to the mean. Is it smaller? larger?


`Hint` <br>
`Use the pandas method .median().`

##### 5. Finally, calculate and print the mode of the math_grade column. What is the most common grade earned by students in this dataset? How different is this number from the mean and median?


`Hint` <br>
`Use the pandas method .mode().`

`Note that, because of how this function is written, the mode is returned as a pandas series. In order to convert it to a single value, we can extract the first value in the series (eg., students.math_grade.mode()[0])`

### Summarize the spread of student grades
##### 6. Next, let’s summarize the spread of student grades. Calculate and print the range of the math_grade column.


`Hint` <br>
`Subtract the minimum student grade from the maximum student grade to get the range, using the pandas methods .max() and .min().`

##### 7. Calculate and print the standard deviation of the math_grade column. About two thirds of values fall within one standard deviation of the mean. What does this number tell you about how much math grades vary?


`Hint` <br>
`Use the pandas method .std() to calculate the standard deviation.`

`The standard deviation is about 4.6, while the average grade is about 10.4. This means that about two thirds of students are earning a grade between 5.8 (calculated as 10.4 - 4.6) and 15 (calculated as 10.4 + 4.6).`

##### 8. Finally, calculate the mean absolute deviation of the math_grade column. This is the mean difference between each students’s score and the average score.


`Hint` <br>
`Use the pandas method .mad().`

### Visualize the distribution of student grades
##### 9. Now that we’ve summarized student grades using statistics for central tendency and spread, let’s visualize the distribution using a histogram. Use the seaborn histplot() function to create a histogram of math_grade.

Note that we’ve provided code to show and clear each plot using:

In [None]:
plt.show()
plt.clf()

This ensures that the plots don’t get layered on top of each other. Make sure that you add your code to call sns.histplot() above plt.show().


`Hint` <br>
`The syntax is:`

In [None]:
sns.histplot(x = 'column_name', data = data_name)

##### 10. Another way to visualize the distribution of a quantitative variable is using a box plot. Use the seaborn boxplot() function to create a boxplot of math_grade.

Make sure to add this code after the first call to plt.clf() from the above plot and before the second call to plt.show().


`Hint` <br>
`The syntax is:`

In [None]:
sns.boxplot(x = 'column_name', data = data_name)

### Summarize mothers' jobs
##### 11. The Mjob column in the dataset contains information about what the students mothers do as a profession. Summarize the Mjob column by printing the number of students who have mothers with each job type.

Which value of Mjob is most common?


`Hint` <br>
`Use the pandas .value_counts() method.`

##### 12. Now, calculate and print the proportion of students who have mothers with each job type. What proportion of students have mothers who work in health?


`Hint` <br>
`Use .value_counts(normalize = True) to calculate the proportion of values in each category.`

### Visualize the distribution of mothers' jobs

##### 13. Now that we’ve used summary statistics to understand the relative frequencies of different mothers’ jobs, let’s visualize the same information with a bar chart. Use the seaborn countplot() function to create a bar chart of the Mjob variable.


`Hint` <br>
`The syntax is:`

In [None]:
sns.countplot(x = 'column_name', data = data_name)

##### 14. We can also visualize the same information using a pie chart. Create a pie chart of the Mjob column.


`Hint` <br>
`The syntax is:`

In [None]:
df.column_name.value_counts().plot.pie()

### Further exploration
##### 15. Congratulations! You’ve begun to explore a dataset by calculating summary statistics and creating some basic data visualizations. There are still a few more columns in this dataset that we haven’t looked at carefully:

- address: the location of the student’s home ('U' for urban and 'R' for rural)
- absences: the number of times the student was absent during the school year
- Fjob: the student’s father’s job industry


Now that we’ve walked you through an exploration of math_grade and Mjob in more detail, take some time to explore the rest of the columns in the dataset! Which kinds of summary statistics and visualizations can you use to summarize these columns?


`Hint` <br>
`The address and Fjob columns are categorical (just like the Mjob column was). Try using .value_counts() or creating a bar chart or pie chart.`

`The absences column is quantitative. Try calculating central tendency and spread statistics and visualizing the distribution of absences using a histogram or box plot.`

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import codecademylib3
import matplotlib.pyplot as plt
import seaborn as sns

# Import data
students = pd.read_csv('students.csv')

# Print first few rows of data
print(students.head())

# Print summary statistics for all columns
print(students.describe(include='all'))

# Calculate mean
math_grade_mean = students.math_grade.mean()
print('Mean math grade: ', math_grade_mean)

# Calculate median
math_grade_median = students.math_grade.median()
print('Median math grade: ', math_grade_median)

# Calculate mode
math_grade_mode = students.math_grade.mode()[0]
print('Mode math grade: ', math_grade_mode)

# Calculate range
math_grade_range = students.math_grade.max() - students.math_grade.min()
print('Range math grade: ', math_grade_range)

# Calculate standard deviation
math_grade_std = students.math_grade.std()
print('Standard Deviation math grade: ', math_grade_std)

# Calculate MAD
math_grade_MAD = students.math_grade.mad()
print('mean absolute deviation math grade: ', math_grade_MAD)

# Create a histogram of math grades
sns.histplot(x = 'math_grade', data = students)


plt.show()
plt.clf()

# Create a box plot of math grades
sns.boxplot(x = 'math_grade', data = students)


plt.show()
plt.clf()

#
#
#

print('Mjob data visualizations:')
# Calculate number of students with mothers in each job category
print(students.Mjob.value_counts())

# Calculate proportion of students with mothers in each job category
print(students.Mjob.value_counts(normalize=True))

# Create bar chart of Mjob
sns.countplot(x = 'Mjob', data = students)


plt.show()
plt.clf()

# Create pie chart of Mjob
students.Mjob.value_counts().plot.pie()
plt.show()
plt.clf()

#
#
#

print('Address data visualizations:')
# Calculate number of students with addresses in each category
print(students.address.value_counts())

# Calculate proportion of students with addresses in each job category
print(students.address.value_counts(normalize=True))

# Create bar chart of addresses
sns.countplot(x = 'address', data = students)


plt.show()
plt.clf()

# Create pie chart of addresses
students.address.value_counts().plot.pie()
plt.show()
plt.clf()

#
#
#

print('Absences data visualizations:')
# Calculate central tendency for Absences
# Calculate mean
absences_mean = students.absences.mean()
print('Mean absences: ', absences_mean)

# Calculate median
absences_median = students.absences.median()
print('Median absences: ', absences_median)

# Calculate mode
absences_mode = students.absences.mode()[0]
print('Mode absences: ', absences_mode)

# Calculate summary of spread(range) statistics for Absences
absences_range = students.absences.max() - students.absences.min()
print('Range absences: ', absences_range)

# Calculate standard deviation
absences_std = students.absences.std()
print('Standard Deviation absencese: ', absences_std)

# Calculate MAD
absences_MAD = students.absences.mad()
print('mean absolute deviation absences: ', absences_MAD)

# Create a histogram of absences
sns.histplot(x = 'absences', data = students)


plt.show()
plt.clf()

# Create a box plot of absences
sns.boxplot(x = 'absences', data = students)


plt.show()
plt.clf()

#
#
#

print('Fjob data visualizations:')
# Calculate number of students with Fathers in each job category
print(students.Fjob.value_counts())

# Calculate proportion of students with Fathers in each job category
print(students.Fjob.value_counts(normalize=True))

# Create bar chart of Fathers
sns.countplot(x = 'Fjob', data = students)
plt.show()
plt.clf()

# Create pie chart of Fathers
students.Fjob.value_counts().plot.pie()
plt.show()
plt.clf()