# Girls Who Code - Python Series
## Pandas
## Mentor - Amir ElTabakh

Pandas is a Python library used for data manipulation and analysis. It's name is a play on "Python Data Analysis", and was published as an open source library in 2009 by Wes McKinney. 

#### Agenda
- Installing Python packages on your machine
- Data Exploration
- Data Cleaning
- Data Visualizations

Pandas does not come with the standard Python library, as Python is open source and developers are creating new libraries all the time. To install Pandas on our machine we will pip install it. pip is the standard package manager for Python, it allows you to install and manage additional packages. The Python installer installs pip, so it should be ready for use. Verify that pip is installed by running the following command

In [None]:
!pip --version

Now let's pip install pandas with the following command. Note when using a Notebook, such as this one on Jupyter, we can run shell commands by starting a line with an exclamation mark.

In [None]:
!pip install pandas

Now that we've installed Pandas, lets import the library. Note that we only have to install a library once per machine, but we have to import it in every program we wish to use the library in.

In [None]:
import pandas as pd

Pandas is the most common library for data analytics, and data wrangling. Thankfully theres a lot of documentation.

https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide


Excel files are commonly saved as either a `.csv` or `.xlsx` files. CSV stands for Comma Seperated Values, its a plain text file that contains a list of data. XLSX files are files used in Microsoft Excel, a spreadsheet application that uses tables to organize, analyze, and store data. Microsoft Excel encourages saving your file as an `.xlsx` file.

We will be importing CSV files in our workshop, the code to import a CSV file is different from the code to import an XLSX file. To import an XLSX file run this code.

`variable_name = pd.read_excel("Resources/file_name.xlsx", sheet_name="optional")`

In [None]:
# Importing CSVs
school_data_df = pd.read_csv("Resources/schools_complete.csv")
student_data_df = pd.read_csv("Resources/students_complete.csv")

## Data Exploration

In [None]:
school_data_df.head(-1)

In [None]:
student_data_df.head()

In [None]:
student_data_df[["reading_score", "math_score"]].describe()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe

This is the documentation for the describe method.
```
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
```
We've gotten a high level overview of the reading and the math scores, but what is the average of the two scores?

In [None]:
reading_score_mean = student_data_df["reading_score"].mean()
math_score_mean = student_data_df["math_score"].mean()
total_mean = (reading_score_mean + math_score_mean) / 2

print(f"Reading Score Mean: {round(reading_score_mean, 2)}")
print(f"Math Score Mean: {round(math_score_mean, 2)}")
print(f"Average Score: {round(total_mean, 2)}")

## Data Cleaning
### Checking for missing data

Lets take a look at the first 5 rows of each dataframe.

In [None]:
# To get the total number of empty rows, or rows that are "True", we can use the Pandas ".sum()" method
# after the ".isnull()" method.
student_data_df.isnull().sum()

In [None]:
student_data_df.notnull()

Thankfully there are no null values in our student_df dataframe, so that saves us some time. There are multiple approaches to dealing with missing data. Lets import a dataset and practice how we would deal with missing data.

Consider if have missing data points in the `reading_score` and the `math_score` columns. If we do nothing, when we sum or take the averages of the reading and math scores, those NaNs will not be considered. However if we multiply or divide with a row that has a NaN, the answer will be NaN. This may cause problems.

There are two simple approaches to dealing with the missing data.

- Drop the rows where there are NaNs. This can cause problems later if there is data in the other rows that we need. Before dropping rows with NaN, you should ask yourself how much data would be removed if NaNs are dropped, and how it would impact analysis.
```
# Drop the NaNs
missing_grade_df.dropna()
```
- We can choose to fill in the row. Filling in an empty row must be used with caution, adding irrelevant data may impact arithmetic calculations.
```
# Fill in the empty rows with 85.
missing_grade_df.fillna(85)
```
There are so many ways to deal with missing data, find one that works for your needs.

### Cleaning Student Names

Some names have prefixes. Row 4 has a student with the prefix 'Dr.'. Lets remove all of the prefixes.

In [None]:
# Outputting the column vector `student_name`.
student_data_df["student_name"]

In [None]:
student_names = student_data_df["student_name"].tolist()
student_names[0:10]

In [None]:
# Filter this list using a conditional statement. If the length of the name
# is greater than 2, we append it to a new list
students_to_fix = []

for name in student_names:
    if len(name.split()) > 2:
        students_to_fix.append(name)

print("There are " + str(len(students_to_fix)) + " invalid names out of a total " + str(len(student_names)) + ".")
students_to_fix[0:10]

In [None]:
# Add the prefixes less than or equal to 4 to a new list
prefixes = []

for name in students_to_fix:
    if len(name.split()[0]) <= 4:
        prefixes.append(name.split()[0])
        
print(pd.Series(prefixes).unique())

In [None]:
prefixes = ['Dr.', 'Mr.', 'Mrs.', 'Miss', 'Ms.']

In [None]:
# Add the suffixes less than or equal to 3 to a new list
suffixes = []

for name in students_to_fix:
    if len(name.split()[-1]) <= 3:
        suffixes.append(name.split()[-1])
        
print(pd.Series(suffixes).unique())

In [None]:
suffixes = ['MD', 'III', 'DVM', 'DDS', 'II', 'PhD', 'Jr.', 'IV']

In [None]:
# Add each prefix and suffix to remove to a list.
prefixes_suffixes = prefixes + suffixes

In [None]:
# Iterate through the "prefixes_suffixes" list and replace them with an empty space, "" when it appears in the student's name

for word in prefixes_suffixes:
    student_data_df["student_name"] = student_data_df["student_name"].str.replace(word,"")
    
# Put the cleaned student's names in another list
student_names = student_data_df["student_name"].tolist()
student_names[:10]

## Practice Reading Documentation

Use the [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) to answer the following questions.

In [None]:
# What school in `school_data_df` has the highest budget
# ...

In [None]:
# What is the total sum of the budgets of all schools in `school_data_df`
# ...

In [None]:
# What school has the greatest average reading score
# ...

# What school has the lowest average math score
# ...

In [None]:
# What is the proportion of Males to Females according to `student_data_df`
# ...

In [None]:
# What are the mean scores per school?
# ...

In [None]:
# What are the mean scores per grade?
# ...

## Data Visualization


### Histogram
The histogram shows the distribution of a continuous variable. It can discover the frequency distribution for a single variable in a univariate analysis.

In [None]:
# Importing dependencies
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Histogram of Reading Scores in the District
plt.hist(student_data_df['reading_score'])
plt.title("Histogram of Reading Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show() # This displays the graph

In [None]:
# Histogram of Reading Scores in the District
plt.hist(student_data_df['reading_score'], bins = 30)
plt.title("Histogram of Reading Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show() # This displays the graph

In [None]:
# Barplot of Budgets in the District
school_name_list = school_data_df['school_name'].tolist()
school_budget_list = school_data_df['budget'].tolist()

sns.barplot(school_budget_list, school_name_list)
plt.title("Barplot of School Budgets")

In [None]:
school_data_df['type'].value_counts()[1]

In [None]:
# Create a Pie Chart of the 'District'/'Charter' distribution
school_data_df['type'].value_counts()

charter_count = school_data_df['type'].value_counts()[0]
district_count = school_data_df['type'].value_counts()[1]

In [None]:
labels = ['Charter', 'District']
sizes = [charter_count, district_count]
colors = ('cyan', 'coral')
explode = (0.1, 0)

# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(x = sizes, labels = labels, colors = colors, explode = explode, autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("Pie Chart of 'District'/'Charter' Distribution")

In [None]:
# Histogram of Math Scores in the District
plt.hist(student_data_df['math_score'], bins = 20)
plt.title("Histogram of Math Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show() # This displays the graph

In [None]:
# Create a Pie Chart of the 'M'/'F' distribution
student_data_df['gender'].value_counts()

M_count = student_data_df['gender'].value_counts()[0]
F_count = student_data_df['gender'].value_counts()[1]

In [None]:
# Pie Chart of 'M'/'F' distribution

labels = ['M', 'F']
sizes = [M_count, F_count]
colors = ('cyan', 'indigo')

# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(x = sizes, labels = labels, colors = colors, autopct='%1.1f%%')
plt.title("Pie Chart of M/F Distribution")