This notebook is designed to capture notes on Python code for quick reference!

# What is EDA
Exploratory Data Analysis (EDA for short) is all about getting curious about your data – finding out what is there, what patterns you can find, and what relationships exist. EDA is the important first step towards analysis and model building. When done well, it can help you formulate further questions and areas for investigation, and it almost always helps uncover aspects of your data that you wouldn’t have seen otherwise.

## Goals of EDA
Depending on what you want to do with your data, EDA can take many different forms; However, the main goals of EDA are generally:

- Uncover the data structure and determine how it is coded
- Inspect and “get to know” the data by summarizing and visualizing it
- Detect outliers, missing data, and other anomalies and decide how/whether to address these issues
- Find new avenues for analysis and further research
- Prepare for model building or analysis, including the following:
1. Check assumptions
2. Select features
3. Choose an appropriate method

## EDA Techniques
Just as the goals of EDA may vary, so do the techniques used to accomplish those goals. That said, the EDA process generally involves strategies that fall into the following three categories:
- Data inspection
- Numerical summarization
- Data visualization

### Data Inspection
Data inspection is an important first step of any analysis. This can help illuminate potential issues or avenues for further investigation. For example, we might use the pandas .head() method to print out the first five rows of a dataset:
    print(dataframe.head())

Based on the output of the command, we can determine a column as a quantitative variable. In order to summarize it, we'll need to make sure it is stored as an int or float.

We may also notice that there is at least one instance of missing data, which appears to be stored as nan. As a next step, we could investigate further to determine how much missing data there is and what we want to do about it.

### Numerical Summarization
Once we’ve inspected our data and done some initial cleaning steps, numerical summaries are a great way to condense the information we have into a more reasonable amount of space. For numerical data, this allows us to get a sense of scale, spread, and central tendency. For categorical data, this gives us information about the number of categories and frequencies of each. In pandas, we can get a quick collection of numerical summaries using the .describe() method:
    dataframe.describe(include = 'all')
Based on the table from the command, we can see the number of unique values, the max of an observation, the average, and other aggregates.

### Data Visualization
While numerical summaries are useful for condensing information, visual summaries can provide even more context and detail in a small amount of space. There are many different types of visualizations that we might want to create as part of EDA. For example, histograms allow us to inspect the distribution of a quantitative feature, providing information about central tendency, spread, and shape (eg., skew or multimodality). 

Other kinds of visualizations are useful for investigating relationships between multiple features. For example, the scatterplot shows the relationship between two variables.

## EDA as a Cyclical Process
Though EDA is commonly performed at the start of a project — before any analysis or model building — you may find yourself revisiting EDA again and again. It is quite common for more questions and problems to emerge during an analysis (or even EDA itself!). EDA is also a great tool for tuning a predictive model to improve its accuracy. It is therefore useful to think of EDA as a cycle rather than a linear process in a data science workflow.

EDA is a crucial step before diving into any data project because it informs data cleaning, can illuminate new research questions, is helpful in choosing appropriate analysis and modeling techniques, and can be useful during model tuning.

## Assessing Variable Types
Variables define datasets. They are the characteristics or attributes that we evaluate during data collection. There are two ways to do that evaluation: we can measure or we can categorize. How we evaluate determines what kind of variable we have. Since there are only two ways to get data, there are only two types of variables: numerical and categorical.

Every observation (the individuals or objects we are collecting data about) is classified according to its characteristics. In “flat” file formats (like tables, csvs, or DataFrames), the observations are the rows, the variables are the columns, and the values are at the intersection.

Typically, the best way to understand your data is to look at a sample of it. In the example dataset about cereal below, we can look at the first few rows with the .head() method to get an idea of the variable types that we have.
```
print(cereal.head())
    id	    name	    mfr	       type	fiber	rating	shelf	vitamins	coupons	price
0	22341	100% Bran…	Nestle	    C	10.0	68.40	top	25	4	        3.46
1	22791	100% Natur…	Quaker Oats	C	2.0	    33.98	top	0	1	        3.36
2	98141	All-Bran…	Kelloggs	C	9.0	    59.43	top	25	4	        2.07
3	20001	All-Bran w…	Kelloggs	C	14.0	93.70	top	25	3	        3.57
4	67121	Almond Del…	Ralston P..	C	1.0	    34.38	top	25	1	        5.21
```

There are several types of variables. For example:
- The price column describes how much the cereal costs. We don’t know if that’s how much the consumer pays or the grocer pays, but we can be fairly sure that it’s a numerical variable.
- In the mfr column, there are labels like Nestle, Quaker Oats, and Kelloggs, which seem like brands. Since brands are categories, mfr is most likely a categorical variable.
- The id column also has numbers, but we can assume that since it’s the id, it’s not actually representing a value. It’s probably the label for the observation. Since it’s a label, even though it’s a number, id is a categorical variable.

If we were downloading this from a data repository, we would expect a data dictionary to define these variables and validate (or invalidate) our assumptions. It is still important to inspect our dataset because it gives us a better understanding of the data that we are working with and the kinds of operations that will be possible.

The following dataset is a [modified version of this Netflix data](https://www.kaggle.com/shivamb/netflix-shows). The est_budget (USD) and cast_count variables were created for illustration purposes.

### Exercise

In [1]:
# Import pandas with alias
import pandas as pd

# Import dataset as a Pandas dataframe
#movies = pd.read_csv("netflix_movies.csv", index_col=0)

# Codecademy cleaned the netflix_movies.csv. Here are the first five rows for the data.
data = {
    "show_id": ["s1", "s2", "s3", "s4", "s5"],
    "type": ["Movie", "TV Show", "TV Show", "TV Show", "TV Show"],
    "title": ["Dick Johnson is Dead", "Blood & Water", "Ganglands", "Jailbirds New Orleans", "Kota Factory"],
    "country": ["United States", "South Africa", None, None, "India"],
    "release_year": [2020, 2021, 2021, 2021, 2021],
    "rating": ["PG-13", "R", "R", "R", "R"], 
    "duration": ["90 min", "2 Seasons", "1 Season", "1 Season", "2 Season"],
    "est_budget": [24879482, 45905454, 81636844, 46693301, 73334474]
}

movies = pd.DataFrame(data)


# View the first five rows of the dataframe
#print(movies.head(5))

# Set the correct value for rating_variable_type
rating_variable_type = "categorical"
#print(rating_variable_type)

## Categorical Variables
We move through the world by categorizing things into various groups: safe/unsafe, best/worst, on/off. These categorizations help us process information. They also create major problems for us when we over-categorize, and can lead to bias and unfair assumptions. However, we still do it, and regardless of the dangers, categorization helps us transform the world around us into data. (Being aware of the dangers is a crucial part of data analysis, but outside the scope of this lesson.) Categorical variables come in 3 types:
1. Nominal variables, which describe something,
2. Ordinal variables, which have an inherent ranking, and
3. Binary variables, which have only two possible variations.

### Nominal Variables
When we want to describe something about the world, we need a nominal variable. Nominal variables are usually words (i.e., red, yellow, blue or hot, cold), but they can also be numbers (i.e., zip codes or user id’s). 

Often, nominal variables describe something with a lot of variation. It can be hard to capture all of that variation, so an ‘Other’ category is often necessary. For example, in the case of color, we could have a lot of different labels, but might still need an ‘Other’ category to capture anything we missed.

### Ordinal variables
When our categories have an inherent order, we need an ordinal variable. Ordinal variables are usually described by numbers like 1st, 2nd, 3rd. Places in a race, grades in school, and the scales in survey responses (Likert Scales) are ordinal variables. 

Ordinal variables can be a little tricky because even though they are numbers, it doesn’t make sense to do math on them. For example, let’s say an Olympian won a Gold medal (1st place) and a Bronze medal (3rd place). We wouldn’t say that they averaged Silver medals (2nd place).

Though there is [some debate about whether Likert scales should be treated like intervals or ordinal categories](https://en.wikipedia.org/wiki/Likert_scale), most statisticians agree that they are ordinal categories and therefore should not be summarized numerically.

### Binary Variables
When there are only two logically possible variations, we need a binary variable. Binary variables are things like on/off, yes/no, and TRUE/FALSE. If there is any possibility of a third option, it is not a binary variable.

Let’s take a look at our cereal dataset.
```
print(cereal.head())
    id	    name	    mfr	       type	fiber	rating	shelf	vitamins	coupons	price
0	22341	100% Bran…	Nestle	    C	10.0	68.40	top	25	4	        3.46
1	22791	100% Natur…	Quaker Oats	C	2.0	    33.98	top	0	1	        3.36
2	98141	All-Bran…	Kelloggs	C	9.0	    59.43	top	25	4	        2.07
3	20001	All-Bran w…	Kelloggs	C	14.0	93.70	top	25	3	        3.57
4	67121	Almond Del…	Ralston P..	C	1.0	    34.38	top	25	1	        5.21
```

There are some obvious categorical variables: The name of the product, the mfr (manufacturer), and the shelf are all nominal categorical variables. We know this because they are written in descriptive words or letters.

A little less obvious is the type field. They are all ‘C’, which could be a ranking (A, B, C, and therefore an ordinal variable) or it could be a description and therefore a nominal variable. We would have to return to the data dictionary to find out for certain.

The id field may also cause confusion. It’s a number, but it’s not a count or a measurement. Rather, ‘id’ is a categorical variable since it is describing each observation in the same way that the name is.

### Exercise

In [2]:
# View the first five rows of the dataframe
#print(movies.head())

# Print the unique values in the country column
#print(movies.country.unique())

# Set the correct value for country_variable_type
country_variable_type = "nominal"

## Quantitative Variables
Numerical variables are created two ways: through measurement and counting. While measurement is a [matter of philosophical debate](https://plato.stanford.edu/entries/measurement-science/), counting is pretty straightforward. The result is continuous and discrete variables.

Continuous variables come from measurements. For a variable to be continuous, there must be infinitely smaller units of measurement between one unit and the next unit. Continuous variables can be represented by decimal places (but because of rounding, sometimes they are whole numbers). Length, time, and temperature are all good examples of continuous variables because they all increase continuously.

Discrete variables come from counting. For a variable to be discrete, there must be gaps between the smallest possible units. People, cars, and dogs are all good examples of discrete variables.

Some variables depend on context to determine if they are continuous or discrete. Money and time can both be measured continuously or discretely.

For money, all currencies have a smallest-possible-unit (i.e., the cent in USD) and are therefore discrete. However, banks and other institutions sometimes measure money in fractions of a cent, treating it like a continuous variable.

It is therefore always essential to understand how your data was created in order to represent it appropriately.

Let’s take a look at the cereal dataset again.
```
    id	    name	    mfr	       type	fiber	rating	shelf	vitamins	coupons	price
0	22341	100% Bran…	Nestle	    C	10.0	68.40	top	25	4	        3.46
1	22791	100% Natur…	Quaker Oats	C	2.0	    33.98	top	0	1	        3.36
2	98141	All-Bran…	Kelloggs	C	9.0	    59.43	top	25	4	        2.07
3	20001	All-Bran w…	Kelloggs	C	14.0	93.70	top	25	3	        3.57
4	67121	Almond Del…	Ralston P..	C	1.0	    34.38	top	25	1	        5.21
```

There are five numerical variables: fiber, rating, vitamins, coupons, and price. Without looking at the data dictionary, we can make some guesses about what kind of numerical variables they are:

Fiber, rating, and price all have decimal places. That’s our first clue that they might be continuous. Based on our limited knowledge, we might guess that fiber and rating are both continuous measurements that could have more decimal places, and price is discrete because there’s nothing smaller than a cent.

Vitamins and coupons do not have decimal places. Vitamins and coupons both seem like good candidates to be counts and therefore discrete. The answers to “how many vitamins” and “how many coupons” would both be whole numbers. (We already said that ID is categorical in the last exercise)

We would be more confident in our answers if we were able to inspect the documentation. But sometimes documentation isn’t available and you have to take your best guess.

### Exercise

In [3]:
# View the first five rows of the dataframe
#print(movies.head())

# Set the correct value for release_year_variable_type
release_year_variable_type = "discrete" # Year is the smallest unit of time in the example
#print(release_year_variable_type)

# Set the correct value for duration_variable_type
cast_count_variable_type = "discrete" # You cannot have half a person
#print(cast_count_variable_type)

## Changing Numerical Variable Data Types
When you read a data file (such as a csv) with pandas, data types are assigned to each column. Pandas does its best to predict what kind of data type each variable should contain. For example, if a column contains only integer values, it will be stored as an int32 or int64. This usually works, but problems can arise for our analysis later on when there’s a mismatch between the real-world variable type and the data type pandas assigns.

With numerical variables, pandas expects any column that has decimal values to be a float and anything without decimal values to be an integer. If any non-numeric characters appear in the column, pandas will treat it as an object.

It’s possible to determine the data types of the columns in your DataFrame with the .dtypes attribute.

For example, in our cereal dataset, Pandas returned the following list:
    print(cereal.dtypes)
name	 object
id	     int64
name	 object
mfr	     object
type	 object
fiber	 float64
rating	 float64
shelf	 object
vitamins int64
coupons  int64
price	 float64
dtype: object

Best practices for data storage say that we should match the data type of the column with its real-world variable type. Therefore:
- Continuous (numerical) variables should usually be stored as the float data type because they allow us to store decimal values.
- Discrete (numerical) variables should be stored as the int datatype to represent mathematically that they are discrete.
(note that the difference between int32/int64 and float32/float64 does not concern us here – it is an issue for much larger numbers)

Using float and int to store quantitative variables is important so that you can later perform numerical operations on those values. It also helps indicate what the variables refer to in the real world. Keeping them separate helps ensure that we perform the right calculations and get the right results. For example,

If a variable appears with the wrong data type, we can change it with the .astype() function.
    cereal['id'] = cereal['id'].astype("string")
    print(cereal.dtypes)

The .astype() function can be used to convert between a numerical data types, including:
- int32 int64
- float32 float64
- object
- string
- bool

However, some data types require all values to be filled in. For example, you cannot convert between a float and an int if there are any null values.

### Exercise - Clean

In [4]:
# View the first five rows of the dataframe
#print(movies.head())

# Print the data types
print(movies.dtypes)

# Try to change the cast_count variable to an integer of type int64
# We should expect an error because there are NA values!
#movies["cast_count"] = movies["cast_count"].astype("int64")
# Comment the above code and move it to the bottom!

# Fill in the missing cast_count values with 0
movies['cast_count'].fillna(0, inplace = True)

# Change the type of the cast_count column
movies["cast_count"] = movies["cast_count"].astype("int64")

# Check the data types of the columns again. 
#print(movies.dtypes)

show_id         object
type            object
title           object
country         object
release_year     int64
rating          object
duration        object
est_budget       int64
dtype: object


KeyError: 'cast_count'

## Changing Categorical Variable Data Types
Now let’s focus on Categorical variables and make sure they are in the correct format. Let’s take another look at the cereal dataset to assess the data types of our categorical variables.
     print(cereal.dtypes)
name	 object
id	     int64
name	 object
mfr	     object
type	 object
fiber	 float64
rating	 float64
shelf	 object
vitamins int64
coupons  int64
price	 float64
dtype: object

Just like with numerical variables, best practices for categorical data storage say that we should match the data type of the column with its real-world variable type. However, the types are a little more nuanced:
- Nominal variables are often represented by the object data type. Columns in the object data type can contain any combination of values, including strings, integers, booleans, etc. This means that string operations like .lower() are not possible on object columns.
- Nominal variables are also represented by the string data type. However, Pandas usually guesses object rather than string, so if you want a column to be a string, you will likely have to explicitly tell pandas to make it a string. This is most important if you want to do string manipulations on a column like .lower().
- Ordinal variables should be represented as objects, but pandas often guesses int since they are often encoded as whole numbers.
- Binary variables can be represented as bool, but pandas often guesses int or object data types.

We have a lot to change in our cereal dataset, so let’s go through them one by one. We already learned about the .astype() function and can be used to convert into the following categorical data types:
- object
- string
- bool

1. id should be an object since it’s a nominal variable that is not a string.
2. name and mfr should be strings since they are words and we may want to lowercase, uppercase, or otherwise transform them with string methods.
3. shelf and type can stay as objects since they are codes (though it would be just as valid to make them into strings)
```
    cereal['id'] = cereal['id'].astype("object")
    cereal['name'] = cereal['name'].astype("string")
    cereal['mfr'] = cereal['mfr'].astype("string")
```
name	object
id	    object
name	string
mfr	    string
type	object
fiber	float64
rating	float64
shelf	object
vitamins int64
coupons	int64
price	float64
dtype: object
    
Now it’s time for you to try it on the Netflix data. Be sure to take into account how the data is recorded and what you might want to do with each variable.

### Exercise

In [None]:
# View the first five rows of the dataframe
#print(movies.head())

# Print the data types of dataframe 
#print(movies.dtypes)

# Add the variables you plan to change to this list
change = ['title', 'rating']

# Change the title variable to a "string"
movies['title'] = movies['title'].astype('string') 

# Change any other variables
movies['rating'] = movies['rating'].astype("string")

# Print the data types again
#print(movies.dtypes)

## The Pandas Category Data Type
For ordinal categorical variables, we often want to store two different pieces of information: category labels and their order. None of the data types we’ve covered so far can store both of these at once. For example, let’s take another look at the shelf variable in our cereal DataFrame, which contains the shelf each item is on stored as strings. We can use the .unique() method to inspect the category names:
```python
print(cereal['shelf'].unique())
# Output
# [top, mid, bottom]
```

At this point, Python does not know that these categories have an inherent order. Luckily, there is a specific data type for categorical variables in pandas called category to address this problem! The pandas .Categorical() method can be used to store data as type category and indicate the order of the categories.
```python
cereal['shelf'] = pd.Categorical(cereal['shelf'], ['bottom', 'mid', 'top'], ordered=True)
print(cereal['shelf'].unique())
# Output
# [bottom, mid, top]
# Categories (6, object): [bottom < mid < top]
```

Now, not only does Python recognize that the shelf column is an ordinal variable, it understands that top > mid > bottom. If we call .unique() on this column again, we see how Python retains the correct rankings.

This is helpful in the event that we would like to sort the column by category; if we use .sort_values(), the DataFrame will be sorted by the logical order of the shelf column as opposed to the alphabetical order.


### Exercise

In [None]:
# Import dataset as a Pandas Dataframe
movies = pd.read_csv('netflix_movies.csv')

# View the first five rows of the dataframe
#print(movies.head())

# Print the unique values of the rating column
print(movies['rating'].unique())

# Change the data type of `rating` to category
movies["rating"] = pd.Categorical(movies["rating"], ["NR", "G", "PG", "PG-13", "R"], ordered = True)

# Recheck the values of `rating` with .unique()
print("")
print(movies.rating.unique())

## One-Hot Encoding
In the previous exercise, we saw how label encoding can be useful for ordinal categorical variables. But sometimes we need a different approach. This could be because:
- We have a nominal categorical variable (like breed of dog), so it doesn’t really make sense to assign numbers like 0,1,2,3,4,5 to our categories, as this could create an order among the species that is not present.
- We have an ordinal categorical variable but we don’t want to assume that there’s equal spacing between categories.

Another way of encoding categorical variables is called One-Hot Encoding (OHE). With OHE, we essentially create a new binary variable for each of the categories within our original variable. This technique is useful when managing nominal variables because it encodes the variable without creating an order among the categories.

Let’s take a look at the titanic dataframe.

    Survived	Pclass	Name	                                      SibSp Parch	Fare	Cabin	Embarked
0	0	        3	    Braund, Mr. Owen Harris	                        1	0	    7.2500	NaN	    S
1	1	        1	    Cumings, Mrs. John Bradley (Florence Briggs Th…	1	0	    71.2833	C85	    C
2	1	        3	    Heikkinen, Miss. Laina	                        0	0	    7.9250	NaN	    S
3	1	        1	    Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	0	    53.1000	C123	S
4	0	        3	    Allen, Mr. William Henry                        0	0	    8.0500	NaN	    S

To perform OHE on a variable within a pandas dataframe, we can use the pandas .get_dummies() method which creates a binary or “dummy” variable for each category. We can assign the columns to be encoded in the columns parameter, and set the data parameter to the dataset we intend to alter. The pd.get_dummies() method will also work on data types other than category.

Notice that when using pd.get_dummies(), we are effectively creating a new dataframe that contains a different set of variables to the original dataframe.

```
titanic = pd.get_dummies(data=titanic, columns=['Embarked'])
print(titanic.head())
```

Survived	Pclass	Name	SibSp	Parch	Fare	Cabin	Embarked_C	Embarked_Q	Embarked_S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	1	0	71.2833	C85	1	0	0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	0	53.1000	C123	0	0	1
6	0	1	McCarthy, Mr. Timothy J	0	0	51.8625	E46	0	0	1
10	1	3	Sandstrom, Miss. Marguerite Rut	1	1	16.7000	G6	0	0	1
11	1	1	Bonnell, Miss. Elizabeth	0	0	26.5500	C103	0	0	1

By passing in the dataset and column that we want to encode into pd.get_dummies(), we have created a new dataframe that contains three new binary variables with values of 1 for True and 0 for False, which we can view when we scroll to the right in the table. Now we haven’t assigned weighting to our nominal variable. It is important to note that OHE works best when we do not create too many additional variables, as increasing the dimensionality of our dataframe can create problems when working with certain machine learning models.

In [None]:
# Import dataset as a Pandas Dataframe
#cereal = pd.read_csv('cereal.csv', index_col=0)

# Show the first five rows of the `cereal` dataframe
#print(cereal.head())

# Create a new dataframe with the `mfr` variable One-Hot Encoded
#cereal = pd.get_dummies(data = cereal, columns = ["mfr"])

# Show first five rows of new dataframe
#print(cereal.head())

## Variable Types Review
You’ve done a fantastic job! In this lesson, you have:
- Discovered the different types of variables you will encounter when working with data and their corresponding data types in Python.
- Explored datasets with .head().
- Assessed categories within variables with the .unique() method.
- Practiced ways to check the data type of variables like the .dtypes attribute.
- Altered data with the .fillna() method.
- Learned how to change the data types of variables using the .astype() method.
- Investigated the pandas category data type.
- Developed your One-Hot Encoding skills with the pd.get_dummies() method.

In this lesson, we used a cereal dataset from [Kaggle](https://www.kaggle.com/crawford/80-cereals) , which was originally created by Chris Crawford and which contains data on various cereal brands in the US. We made alterations to this data for the purposes of the lesson. The other datasets used in this lesson can be found here:
- The [movies](https://www.kaggle.com/shivamb/netflix-shows) dataset courtesy of Shivam Bansal via Kaggle.
- The [auto](https://archive.ics.uci.edu/ml/datasets/Automobile) dataset courtesy of UCI Machine Learning Repository.
- The [titanic](https://www.kaggle.com/heptapod/titanic) dataset courtesy of Khashayar Baghizadeh Hosseini via Kaggle.
- The [clothes](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) dataset courtesy of Nicapotato via Kaggle.

Let’s practice the skills you just learned. Because this is review, we won’t check your work on these tasks. If you get an error, take a look at the hints or go back to that exercise in this lesson and review how to do it.

### Exercise Review

In [None]:
# Import pandas with alias
import pandas as pd

# Import dataset as a Pandas Dataframe
#auto = pd.read_csv('autos.csv', index_col=0)

# Print the first 10 rows of the auto dataset
#print(auto.head(10))

# Print the data types of the auto dataframe
#print(auto.dtypes)

# Change the data type of price from int to float with .astype() method
#auto["price"] = auto["price"].astype("float")

# Convert the engine_size variable to the category data type with an order of [‘small’, ‘medium’, ‘large’], and check the order with the .unique() method.
#auto["engine_size"] = pd.Categorical(auto["engine_size"], ["small", "medium", "large"], ordered = True)
#print(auto.engine_size.unique())

# Create a new variable called engine_codes which contains the numerical codes associated with each category in the engine_size variable with the .cat.codes accessor. Check the new values with the .head() method.
#auto["engine_codes"] = auto["engine_size"].cat.codes
#print(auto.head())

# One-Hot Encode the body-style category in the auto dataframe. Then check the dataframe with .head().
#auto = pd.get_dummies(data = auto, columns = ["body-style"])
#print(auto.head())

### Exercise Census Variables

In [None]:
# Import pandas with alias
#import pandas as pd

# Read in the census dataframe
#census = pd.read_csv('census_data.csv', index_col=0)

# Task 1: The census dataframe is composed of simulated census data to represent demographics of a small community in the U.S. Call the .head() method on the census dataframe and print the output to view the first five rows.
#print(census.head())

# Task 2: Review the dataframe description and values returned by .head() to assess the variable types of each of the variables. This is an important step to understand what preprocessing will be necessary to work with the data.

# Task 3: Compare the values returned from the .head() method with the data types of each variable by calling .dtypes on the census dataframe and print the result.
#print(census.dtypes)

# Task 4: The manager of the census would like to know the average birth year of the respondents. We were able to see from .dtypes that birth_year has been assigned the str datatype whereas it should be expressed in int. Print the unique values of the variable using the .unique() method.
#print(census.birth_year.unique())

# Task 5: There appears to be a missing value in the birth_year column. With some research you find that the respondent’s birth year is 1967. Use the .replace() method to replace the missing value with 1967, so that the data type can be changed to int. Then recheck the values in birth_year by calling the .unique() method and printing the results.
#census["birth_year"] = census["birth_year"].replace(["missing"], 1967)

#print(census["birth_year"].unique())

# Task 6: Now that we have adjusted the values in the birth_year variable, change the datatype from str to int and print the datatypes of the census dataframe with .dtypes.
#print(census.dtypes)
#census["birth_year"] = census["birth_year"].astype("int")
#print(census.dtypes)

# Task 7: Having assigned birth_year to the appropriate data type, print the average birth year of the respondents to the census using the pandas .mean() method.
#avg_birth_year = census.birth_year.mean()
#print("The average birth year is " + str(avg_birth_year))

# Task 8: Your manager would like to set an order to the higher_tax variable so that: strongly disagree < disagree < neutral < agree < strongly agree. Convert the higher_tax variable to the category data type with the appropriate order, then print the new order using the .unique() method.
#census["higher_tax"] = pd.Categorical(census["higher_tax"], ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"], ordered = True)
#print(census.higher_tax.unique())

# Task 9: Your manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode the higher_tax variable and print the median using the pandas .median() method.
#census["higher_tax"] = census["higher_tax"].cat.codes
#print(census.higher_tax.median())

# Task 10: Your manager is interested in using machine learning models on the census data in the future. To help, let’s One-Hot Encode marital_status to create binary variables of each category. Use the pandas get_dummies() method to One-Hot Encode the marital_status variable. Print the first five rows of the new dataframe with the .head() method. Note that you’ll have to scroll to the right or expand the web-browser to see the dummy variables.
#census = pd.get_dummies(data = census, columns = ["marital_status"])
#print(census.head())

# Task 11: Create a new variable called marital_codes by Label Encoding the marital_status variable. This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.
#census = pd.read_csv('census_data.csv', index_col=0)
#census["marital_status"] = pd.Categorical(census["marital_status"])
#census["marital_codes"] = census["marital_status"].cat.codes
#print(census.head())

# Task 12: Create a new variable called age_group, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30, 31-35, etc. Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.

-----

# EDA: Inspect, Clean, and Validate a Dataset
One of the most challenging parts of data cleaning is diagnosing data issues and figuring out HOW to most effectively address them. In order to accomplish this, exploratory data analysis (EDA) can be an extremely useful tool. In this article, we’ll walk through an example dataset to demonstrate how EDA can inform the initial data inspection, cleaning, and validation process.

While this article serves as an introduction to EDA for data cleaning, it is important to note that every dataset is different, and therefore will require different exploration. EDA is all about following the data, verifying your assumptions, and investigating anything that is unexpected.

## Initial Data Inspection
Before analysis or cleaning, it is useful to print a few rows of data. This helps ensure that the data is properly loaded. It also allows us to compare the observed data to the data dictionary and determine whether the coding appears to match our expectations. For example, let’s load and inspect the first few rows of a dataset of heart disease patients (downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/heart+disease)).

```
import pandas as pd
heart = pd.read_csv('processed.cleveland.data.csv')
print(heart.head())

```
There are a few things we might want to inspect. For example, the data dictionary gives the following information about the cp column:
cp: chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic

Based on this information, it’s not necessarily clear whether the data is going to be coded as numerical values (eg., 1, 2, 3, or 4) or with strings (eg., 'typical angina'). Data inspection allows us to clarify that this column contains numerical values.

Similarly, there is some conflicting information in the data dictionary about the target column (note: we renamed this column as heart_disease before loading it, but it was originally coded as num). The list of features contains the following information about this column:

num: diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

However, the initial data description suggests that the target field is integer valued from 0-4, where 0 indicates no heart disease, and values 1-4 indicate the presence of heart disease.

By inspecting the first few rows of data, we see at least one instance of the value 2 in the heart_disease column. This suggests that the values probably range from 0-4 instead of just 0-1. We could verify this with further exploration (e.g., by using `heart.heart_disease.value_counts()` to get a table of values in this column).

## Data Information
Once we’ve taken a first look at some data, a common next step is to address questions such as:
- How many (non-null) observations do we have?
- How many unique columns/features do we have?
- Which columns (if any) contain missing data?
- What is the data type of each column?

Using pandas, we can easily address these questions using the .info() method. For example:
```
print(heart.info()

# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             303 non-null    object 
 12  thal           303 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
```
There are a few interesting pieces of information that we can glean from this output:
- There are 303 rows and 14 columns of data
- At first glance, there are no null (i.e., missing) values in any column (we’ll come back to this)
- The ca and thal columns have a data type of object (which suggests that they are strings), even though we saw in our initial inspection that these columns appear to contain numerical values

To investigate the unexpected output here, we might want to take a look at the unique values in the ca column:

```
print(heart.ca.unique())

# Output
array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)
```
We note that at least one row contains a '?' in this column. We can probably assume that this indicates mis-coded missing data. The '?' also probably forced the column to be coded as a string because there is no obvious way to cast a '?' to a numerical value.

Given this information, we now have more to do! We can replace any instance of '?' with np.NaN, change the data type of this column back to a float or integer, and then re-print the heart.info() to determine how many missing values we’ve got. Then, we probably want to do a similar inspection of the thal column.

## Inspecting Missing Data
After identifying that there is some missing data and converting it to a format that Python can recognize, it’s often a good idea to take a closer look at those rows. Sometimes, we can find clues as to WHY the data is missing, which can help us make decisions about whether to get rid of the rows altogether or impute the missing values somehow.
```
heart[heart.isnull().any(axis=1)]
````
Looking at this output, we note that there is no overlap between the rows with missing ca data and missing thal data. This suggests that these patients are missing ca and thal information for different reasons. We don’t see any immediate clues as to why the data is missing in the first place, but we can inspect this further once we start digging into individual features.

## Data Exploration in Real-Time
If you’d like to watch us inspect this dataset in real-time, feel free to checkout the [livestream recording](https://youtu.be/YwadRm2sfpQ)

If you’d like to play with the data yourself, you can download the code and data from our [Github repository](https://github.com/Codecademy/Master-Statistics-Live-Series).







----

# Exploratory Data Anlysis: Summary Statistics

Summary statistics are an important component of Exploratory Data Analysis (EDA) because they allow a data analyst to condense a large amount of information into a small set of numbers that can be easily interpreted. In order to decide what kind of summary statistic to use, it is important to consider two things:

- The question (and how many variables that question involves)
- The data (is it quantitative or categorical?)

## Univariate Statistics
Summary statistics that focus on a single variable are called univariate statistics. They are useful for answering questions about a single feature in tabular data. For example, the following dataset contains information about used cars listed on cardekho.com:
```
name	year	selling_price	km_driven	fuel	transmission	owner	mileage	engine
0	Maruti Swift Dzire VDI	2014	450000	145500	Diesel	Manual	First Owner	23.4 kmpl	1248 CC
1	Skoda Rapid 1.5 TDI Ambition	2014	370000	120000	Diesel	Manual	Second Owner	21.14 kmpl	1498 CC
2	Honda City 2017-2020 EXi	2006	158000	140000	Petrol	Manual	Third Owner	17.7 kmpl	1497 CC
3	Hyundai i20 Sportz Diesel	2010	225000	127000	Diesel	Manual	First Owner	23.0 kmpl	1396 CC
4	Maruti Swift VXI BSIII	2007	130000	120000	Petrol	Manual	First Owner	16.1 kmpl	1298 CC
```

Univariate statistics can help us answer questions like:

- How much does a typical car cost?
- What proportion of cars have a manual transmission?
- How old is the oldest listed car?

Each of these questions focuses on a single variable (`selling_price`, `transmission`, and `year`, respectively, for the above examples). Depending on the type of variable, different summary statistics are appropriate.

### Quantitative Variables
When summarizing quantitative variables, we often want to describe central location and spread.

#### Central Location
The central location (also called central tendency) is often used to communicate the “typical” value of a variable. Recall that there are a few different ways of calculating the central location:
- Mean: Also called the “average”; calculated as the sum of all values divided by the number of values.
- Median: The middle value of the variable when sorted.
- Mode: The most frequent value in the variable.
- Trimmed Mean: The mean excluding x percent of the lowest and highest data points.

Choosing an appropriate summary statistic for central tendency sometimes requires data visualization techniques along with domain knowledge. For example, suppose we want to know the typical price of a car in our dataset. If we calculate each of the statistics described above, we’ll get the following estimates:
- Mean = Rs. 63827.18
- Median = Rs. 45000.00
- Mode = Rs. 30000.00
- Trimmed Mean = Rs. 47333.61

Because the mean is so much larger than the median and trimmed mean, we might guess that there are some outliers in this data with respect to price. We can investigate this by plotting a histogram of `selling_price`:

```
# Generate a histogram of the selling_price variable
plt.hist(cars['selling_price'])
plt.show()
```

Indeed, we see that `selling_price` is highly right-skewed. The very high prices (10 million Rupees for a small number of cars) are skewing the average upwards. By using the median or a trimmed mean, we can more accurately represent a “typical” price.

#### Spread

Spread, or dispersion, describes the variability within a feature. This is important because it provides context for measures of central location. For example, if there is a lot of variability in car prices, we can be less certain that any particular car will be close to 450000.00 Rupees (the median price). Like the central location measures, there are a few values that can describe the spread:
- Range: The difference between the maximum and minimum values in a variable.
- Inter-Quartile Range (IQR): The difference between the 75th and 25th percentile values.
- Variance: The average of the squared distance from each data point to the mean.
- Standard Deviation (SD): The square root of the variance.
- Mean Absolute Deviation (MAD): The mean absolute value of the distance between each data point and the mean.

Choosing the most appropriate measure of spread is much like choosing a measure of central tendency, in that we need to consider the data holistically. For example, below are measures of spread calculated for `selling_price`:
- Range: Rs. 9970001
- IQR: Rs. 420001
- Variance: 650044550668.61 (Rs^2)
- Standard Deviation: Rs. 806253.40
- Mean Absolute Deviation: Rs. 42,213.14

We see that the range is almost 10 million Rupees; however, this could be due to a single 10 million Rupee car in the dataset. If we remove that one car, the range might be much smaller. The IQR is useful in comparison because it trims away outliers.

Meanwhile, we see that variance is extremely large. This happens because variance is calculated using squared differences, and is therefore not in the same units as the original data, making it less interpretable. Both the standard deviation and MAD solve this issue, but MAD is even less impacted by extreme outliers.

For highly skewed data or data with extreme outliers, we therefore might prefer to use IQR or MAD. **For data that is more normally distributed, the variance and standard deviation are frequently reported**.

#### Categorical Variables

Categorical variables can be either ordinal (ordered) or nominal (unordered). For ordinal categorical variables, we may still want to summarize central location and spread. However, because ordinal categories are not necessarily evenly spaced (like numbers), we should NOT calculate the mean of an ordinal categorical variable (or anything that relies on the mean, like variance, standard deviation, and MAD).

For nominal categorical variables (and ordinal categorical variables), another common numerical summary statistic is the frequency or proportion of observations in each category. This is often reported using a frequency table and can be visualized using a bar plot.

For example, suppose we want to know what kind of fuel listed cars tend to use. We could calculate the frequency of each fuel type:

```
cars.fuel.value_counts()

# Output:
Diesel      2153
Petrol      2123
CNG           40
LPG           23
Electric       1
Name: fuel, dtype: int64
```

This tells us that `'Diesel'` cars are most common, with `'Petrol'` cars a close second. Converting these frequencies to proportions can also help us compare fuel types more easily. For example, the following table of proportions indicates that `'Diesel'` cars account for almost half of all listings.

```
cars.fuel.value_counts(normalize=True)

# Output
Diesel      0.496083
Petrol      0.489171
CNG         0.009217
LPG         0.005300
Electric    0.000230
Name: fuel, dtype: float64
```

### Bivariate Statistics

In contrast to univariate statistics, bivariate statistics are used to summarize the relationship between two variables. They are useful for answering questions like:
- Do manual transmission cars tend to cost more or less than automatic transmission?
- Do older cars tend to cost less money?
- Are automatic transmission cars more likely to be sold by individuals or dealers?

Depending on the types of variables we want to summarize a relationship between, we should choose different summary statistics.

#### One Quantitative Variable and One Categorical Variable

If we want to know whether manual transmission cars tend to cost more or less than automatic transmission cars, we are interested in the relationship between `transmission` (categorical) and `selling_price` (quantitative). To answer this question, we can use a mean or median difference.

For example, we could calculate that the median price of automatic transmission cars is 100000 Rupees higher than for manual transmission cars.

#### Two Quantitative Variables

If we want to know whether older cars tend to cost less money, we are interested in the relationship between `year` and `selling_price`, both of which are quantitative. To answer this question, we can use the Pearson correlation.

For example, if we calculate that the correlation between `year` and `selling_price` is 0.4, we can conclude that there is a moderate positive association between these variables (older cars do tend to cost less money).

#### Two Categorical Variables

If we want to know whether automatic transmission cars are more likely to be sold by individuals or dealers, we are interested in the relationship between `transmission` and `seller_type`, both of which are categorical. We can explore this relationship using a contingency table and the Chi-Square statistic.

For example, based on the following contingency table, we might conclude that a higher proportion of cars sold by dealers are automatic (compared to cars sold by individuals):

```
seller_type   Dealer  Individual  Trustmark Dealer
transmission                                      
Automatic        217         212                19
Manual           777        3032                83
```

### Conclusion

In this article, we’ve summarized some of the important considerations for choosing a summary statistic based on the question a data analyst wants to answer and the type of data that is available. When it comes to choosing summary statistics, there’s no one right answer, but exploring data holistically and systematically is an important component of EDA.

# Data Summaries
Before diving into formal analysis with a dataset, it is often helpful to perform some initial investigations of the data through exploratory data analysis (EDA) to get a better sense of what you will be working with. Basic summary statistics and visualizations are important components of EDA as they allow us to condense a large amount of information into a small set of numbers or graphics that can be easily interpreted.

This lesson focuses on univariate summaries, where we explore each variable separately. This is useful for answering questions about each individual feature. Variables can typically be classified as quantitative (i.e., numeric) or categorical (i.e., discrete). Depending on its type, we may want to choose different summary metrics and visuals to use.

Let’s say we have the following dataset on New York City rental listings imported into a `pandas` DataFrame (subsetted from the [StreetEasy dataset](https://www.codecademy.com/content-items/d19f2f770877c419fdbfa64ddcc16edc)):
```
import pandas as pd

# Import dataset
rentals = pd.read_csv('streeteasy.csv')

# Preview first 5 rows
print(rentals.head())

# Output
rent	size_sqft	borough
2550	480	        Manhattan
11500	2000	    Manhattan
3000	1000	    Queens
4500	916	        Manhattan
4795	975	        Manhattan
```
As seen, we have two quantitative variables (`rent` and `size_sqft`) and one categorical variable (`borough`). The `pandas` library offers a handy method `.describe()` for displaying some of the most common summary statistics for the columns in a DataFrame. By default, the result only includes numeric columns, but we can specify `include='all'` to the method to display categorical ones as well:
```
# Display summary statistics for all columns
print(rentals.describe(include='all'))

# Output
        rent	size_sqft	borough
count	5000.000000	5000.000000	5000
unique	NaN	NaN	3
top	NaN	NaN	Manhattan
freq	NaN	NaN	3539
mean	4536.920800	920.101400	NaN
std	2929.838953	440.150464	NaN
min	1250.000000	250.000000	NaN
25%	2750.000000	633.000000	NaN
50%	3600.000000	800.000000	NaN
75%	5200.000000	1094.000000	NaN
max	20000.000000	4800.000000	NaN
```

This is a great way to get an overview of all the variables in a dataset. Notice how different statistics are displayed depending on the variable type. In the rest of the lesson, we’ll look more closely at the common ways to summarize and visualize quantitative and categorical variables.

### Exercise
```
import pandas as pd

movies = pd.read_csv('movies.csv')

# Print the first 5 rows 
print(movies.head())

# Print the summary statistics for all columns
print(movies.describe(include = "all"))
```

## Central Tendency for Quantitative Data

For quantitative variables, we often want to describe the central tendency, or the “typical” value of a variable. For example, what is the typical cost of rent in New York City?

There are several common measures of central tendency:

- Mean: The average value of the variable, calculated as the sum of all values divided by the number of values.
- Median: The middle value of the variable when sorted.
- Mode: The most frequent value of the variable.
- Trimmed mean: The mean excluding x percent of the lowest and highest data points.

For our `rentals` DataFrame with a column named `rent` that contains rental prices, we can calculate the central tendency statistics listed above as follows:
```
# Mean
rentals.rent.mean()

# Median
rentals.rent.median()

# Mode
rentals.rent.mode()

# Trimmed mean
from scipy.stats import trim_mean
trim_mean(rentals.rent, proportiontocut=0.1)  # trim extreme 10%
```

### Exercise
```
# Save the mean to mean_budget
mean_budget = movies.production_budget.mean()
print(mean_budget)

# Save the median to med_budget
med_budget = movies.production_budget.median()
print(med_budget)

# Save the mode to mode_budget
mode_budget = movies.production_budget.mode()
print(mode_budget)

# Save the trimmed mean to trmean_budget for 20%
from scipy.stats import trim_mean
trmean_budget = trim_mean(movies.production_budget, proportiontocut = 0.2)
print(trmean_budget)

# Question: How do the mean, median, and mode of movie budgets compare to each other? The median and mode for production_budget are the same at $20M, indicating that is both the middle value and the most frequently occurring value. The mean is quite a bit higher at around $33M, suggesting there may be some outlier movies with extremely high budgets that are pulling the average upward.

# Question: How does trimming the most extreme data points affect the mean budget? The trimmed mean is just under $24M, which is much lower compared to the original mean of $33M and also much closer to the median and mode values. This makes sense because the mean is affected by outliers, so removing the extreme values can bring the mean closer to what would be considered a representative, “typical” budget value.
```

## Spread for Quantitative Data
The spread of a quantitative variable describes the amount of variability. This is important because it provides context for measures of central tendency. For example, if there is a lot of variability in New York City rent prices, we can be less certain that the mean or median price is representative of what the typical rent is.

There are several common measures of spread:

- Range: The difference between the maximum and minimum values of a variable.
- Interquartile range (IQR): The difference between the 75th and 25th percentile values.
- Variance: The average of the squared distance from each data point to the mean.
- Standard deviation (SD): The square root of the variance.
- Mean absolute deviation (MAD): The mean absolute value of the distance between each data point and the mean.
For our `rentals` DataFrame, we can calculate the spread for the `rent` column as follows:
```
# Range
rentals.rent.max() - rentals.rent.min()

# Interquartile range
rentals.rent.quantile(0.75) - rentals.rent.quantile(0.25)

from scipy.stats import iqr
iqr(rentals.rent)  # alternative way

# Variance
rentals.rent.var()

# Standard deviation
rentals.rent.std()

# Mean absolute deviation
rentals.rent.mad()
```

### Exercise
```
import pandas as pd

movies = pd.read_csv('movies.csv')

# Save the range to range_budget
range_budget = movies.production_budget.max() - movies.production_budget.min()
print(range_budget)

# Save the interquartile range to iqr_budget
from scipy.stats import iqr 
iqr_budget = iqr(movies.production_budget)
print(iqr_budget)

# Save the variance to var_budget
var_budget = movies.production_budget.var()
print(var_budget)

# Save the standard deviation to std_budget
std_budget = movies.production_budget.std()
print(std_budget)

# Save the mean absolute deviation to mad_budget
mad_budget = movies.production_budget.mad()
print(mad_budget)
```

## Visualizing Quantitative Variables

While summary statistics are certainly helpful for exploring and quantifying a feature, we might find it hard to wrap our minds around a bunch of numbers. This is why data visualization is such a powerful element of EDA.

For quantitative variables, *boxplots* and *histograms* are two common visualizations. These plots are useful because they simultaneously communicate information about minimum and maximum values, central location, and spread. Histograms can additionally illuminate patterns that can impact an analysis (e.g., skew or multimodality).

Python’s `seaborn` library, built on top of `matplotlib`, offers the `boxplot()` and `histplot()` functions to easily plot data from a `pandas` DataFrame:

```
import matplotlib.pyplot as plt 
import seaborn as sns

# Boxplot for rent
sns.boxplot(x='rent', data=rentals)
plt.show()
plt.close()

# Histogram for rent
sns.histplot(x='rent', data=rentals)
plt.show()
plt.close()
```

### Exercise
```
import codecademylib3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

movies = pd.read_csv('movies.csv')

# Create a boxplot for movie budget 
sns.boxplot(x = "production_budget", data = movies)
plt.show()
plt.close()

# Create a histogram for movie budget
sns.histplot(x = "production_budget", data = movies)
plt.show()
plt.close()

# Question: From the plots, what do you notice about the distribution of movie budgets? Both plots show that the distribution of movie budgets is skewed to the right, with some outlier movies having extremely high budgets. This is consistent with the high mean budget value we saw earlier, since the mean is affected by skewness and outliers.
```

## Value Counts for Categorical Data
When it comes to categorical variables, the measures of central tendency and spread that worked for describing numeric variables, like mean and standard deviation, generally becomes unsuitable when we’re dealing with discrete values. Unlike numbers, categorical values are not continuous and oftentimes do not have an intrinsic ordering.

Instead, a good way to summarize categorical variables is to generate a frequency table containing the count of each distinct value. For example, we may be interested to know how many of the New York City rental listings are from each borough. Related, we can also find which borough has the most listings.

The `pandas` library offers the `.value_counts()` method for generating the counts of all values in a DataFrame column:
```
# Counts of rental listings in each borough
df.borough.value_counts()

# Output
Manhattan    3539
Brooklyn     1013
Queens        448
```
By default, it returns the results sorted in descending order by count, where the top element is the mode, or the most frequently appearing value. In this case, the mode is `Manhattan` with 3,539 rental listings.

### Exercise
```
import pandas as pd

movies = pd.read_csv('movies.csv')

# Save the counts to genre_counts
genre_counts = movies.genre.value_counts()
print(genre_counts)
```

## Value Proportions for Categorical Data

A counts table is one approach for exploring categorical variables, but sometimes it is useful to also look at the proportion of values in each category. For example, knowing that there are 3,539 rental listings in Manhattan is hard to interpret without any context about the counts in the other categories. On the other hand, knowing that Manhattan listings make up 71% of all New York City listings tells us a lot more about the relative frequency of this category.

We can calculate the proportion for each category by dividing its count by the total number of values for that variable:
```
# Proportions of rental listings in each borough
rentals.borough.value_counts() / len(rentals.borough)

# Output
Manhattan    0.7078
Brooklyn     0.2026
Queens       0.0896
```

Alternatively, we could also obtain the proportions by specifying `normalize=True` to the `.value_counts()` method:
```
df.borough.value_counts(normalize=True)
```

### Exercise
```
import pandas as pd

movies = pd.read_csv('movies.csv')

# Save the proportions to genre_props
genre_props = movies.genre.value_counts(normalize = True)
print(genre_props)
```

## Visualizing Categorical Variables

For categorical variables, bar charts and pie charts are common options for visualizing the count (or proportion) of values in each category. They can also convey the relative frequencies of each category.

Python’s `seaborn` library offers several functions that can create bar charts. The simplest for plotting the counts is `countplot()`:
```
# Bar chart for borough
sns.countplot(x='borough', data=rentals)
plt.show()
plt.close()
```

There are currently no functions in the `seaborn` library for creating a pie chart, but the `pandas` library provides a convenient [wrapper function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.pie.html) around `matplotlib`‘s `pie()` function that can generate a pie chart from any column in a DataFrame:

```
# Pie chart for borough
rentals.borough.value_counts().plot.pie()
plt.show()
plt.close()
```

In general, many data analysts avoid pie charts because people are better at visually comparing areas of rectangles than wedges of a pie. For a variable with a small number of categories (i.e., fewer than three), a pie chart is a reasonable choice; however, for more complex data, a bar chart is usually preferable.

### Exercise
```
import codecademylib3
import matplotlib.pyplot as plt 
import seaborn as sns
import pandas as pd

movies = pd.read_csv('movies.csv')

# Create a bar chart for movie genre 
sns.countplot(x = "genre", data = movies)
plt.show()
plt.close()

# Create a pie chart for movie genre
movies.genre.value_counts().plot.pie()
plt.show()
plt.close()

# Question: From the plots, what do you notice about the relative frequencies of movie genres? From the plots, we can see that Drama movies appear most frequently and is the mode for genre. Horror movies appear least frequently in the dataset.
```

## Review

In this lesson, you’ve learned about the common ways to summarize and visualize quantitative and categorical variables for the purpose of EDA.

- We can use `.describe(include='all')` to quickly display common summary statistics for all columns in a `pandas` DataFrame.
- For *quantitative variables*, measures of central tendency (e.g., mean, median, mode) and spread (e.g., range, variance, standard deviation) are good ways to summarize the data. Boxplots and histograms are often used for visualization.
- For *categorical variables*, the relative frequencies of each category can be summarized using a table of counts or proportions. Bar charts and pie charts are often used for visualization.

Being able to use the appropriate metrics and visuals to explore the variables in your dataset can help you to draw insights from your data and prepare for more rigorous analysis and modeling down the road.

### Exercise
```
# Load libraries
import pandas as pd
import numpy as np
import codecademylib3
import matplotlib.pyplot as plt
import seaborn as sns

# Import data
students = pd.read_csv('students.csv')

# Print first few rows of data
print(students.head())

# Print summary statistics for all columns
print(students.describe(include = "all"))

# Question: Do more students live in urban or rural locations? 
address_counts = students.address.value_counts()
print(address_counts)
# More students live in urban locations.

# Calculate mean
mean_math = students.math_grade.mean()
print(mean_math)

# Calculate median
median_math = students.math_grade.median()
print(median_math)
# Compare this value to the mean. Is it smaller? larger? The median value is larger than the mean value.

# Calculate mode
mode_math = students.math_grade.mode()
print(mode_math[0])
# What is the most common grade earned by students in this dataset? How different is this number from the mean and median? The most common grade for math is 10. It's not too different from the mean and mode.
# Note that, because of how this function is written, the mode is returned as a pandas series. In order to convert it to a single value, we can extract the first value in the series (eg., students.math_grade.mode()[0])

# Calculate range
range_math = students.math_grade.max() - students.math_grade.min()
print(range_math)

# Calculate standard deviation
std_math = students.math_grade.std()
print(std_math)
# About two thirds of values fall within one standard deviation of the mean. What does this number tell you about how much math grades vary?

# Calculate MAD
mad_math = students.math_grade.mad()
print(mad_math)

# Create a histogram of math grades
sns.histplot(x = "math_grade", data = students)
plt.show()
plt.clf()

# Create a box plot of math grades
sns.boxplot(x = "math_grade", data = students)
plt.show()
plt.clf()

# Calculate number of students with mothers in each job category
mothers = students.Mjob.value_counts()
print(mothers)
# Which value of Mjob is most common? The other category is the most frequent job for mothers.

# Calculate proportion of students with mothers in each job category
mother_proportion = mothers / len(students.Mjob)
print(mother_proportion)

# Question: What proportion of students have mothers who work in health? 0.08607594936708861

# Create bar chart of Mjob
sns.countplot(x = "Mjob", data = students)
plt.show()
plt.clf()

# Create pie chart of Mjob
students.Mjob.value_counts().plot.pie()
plt.show()
plt.clf()
```

# Associations: Quantitative and Categorical Variables

Examining the relationship between variables can give us key insight into our data. In this lesson, we will cover ways of assessing the association between a quantitative variable and a categorical variable.

In the next few exercises, we’ll explore a dataset that contains the following information about students at two portuguese schools:

- `school`: the school each student attends, Gabriel Periera (`'GP'`) or Mousinho da Silveria (`'MS'`)
- `address`: the location of the student’s home (`'U'` for urban and `'R'` for rural)
- `absences`: the number of times the student was absent during the school year
- `Mjob`: the student’s mother’s job industry
- `Fjob`: the student’s father’s job industry
- `G3`: the student’s score on a math assessment, ranging from 0 to 20

Suppose we want to know: Is a student’s score (`G3`) associated with their school (`school`)? If so, then knowing what school a student attends gives us information about what their score is likely to be. For example, maybe students at one of the schools consistently score higher than students at the other school.

To start answering this question, it is useful to save scores from each school in two separate lists:
```
scores_GP = students.G3[students.school == 'GP']
scores_MS = students.G3[students.school == 'MS']
```

### Exercise
```
import numpy as np
import pandas as pd
import codecademylib3

students = pd.read_csv('students.csv')

#print the first five rows of students:
print(students.head())

#separate out scores for students who live in urban and rural locations:
scores_urban = students.G3[students.address == "U"]
scores_rural = students.G3[students.address == "R"]
```

## Mean and Median Differences

Recall that in the last exercise, we began investigating whether or not there is an association between math scores and the school a student attends. We can begin quantifying this association by using two common summary statistics, mean and median differences. To calculate the difference in mean G3 scores for the two schools, we can start by finding the mean math score for students at each school. We can then find the difference between them:
```
mean_GP = np.mean(scores_GP)
mean_MS = np.mean(scores_MS)
print(mean_GP) #output: 10.49
print(mean_MS) #output: 9.85
print(mean_GP - mean_MS) #Output: 0.64
```

We see that the mean math score for students at GP is 10.49, while the mean score for students at MS is 9.85. The mean difference is 0.64. We can follow a similar process to calculate a median difference:
```
median_GP = np.median(scores_GP)
median_MS = np.median(scores_MS)
print(median_GP) #Output: 11.0
print(median_MS) #Output: 10.0
print(median_GP-median_MS) #Output: 1.0
```

GP students also have a higher median score, by one point. Highly associated variables tend to have a large mean or median difference. Since “large” could have different meanings depending on the variable, we will go into more detail in the next exercise.

### Exercise
```
import numpy as np
import pandas as pd
students = pd.read_csv('students.csv')

scores_urban = students.G3[students.address == 'U']
scores_rural = students.G3[students.address == 'R']

#calculate means for each group:
scores_urban_mean = np.mean(scores_urban)
scores_rural_mean = scores_rural.mean()

#print mean scores:
print('Mean score - students w/ urban address:')
print(scores_urban_mean)
print('Mean score - students w/ rural address:')
print(scores_rural_mean)

#calculate mean difference:
mean_diff = scores_urban_mean - scores_rural_mean

#print mean difference
print('Mean difference:')
print(mean_diff)

#calculate medians for each group:
scores_urban_median = np.median(scores_urban)
scores_rural_median = scores_rural.median()

#print median scores
print('Median score - students w/ urban address:')
print(scores_urban_median)
print('Median score - students w/ rural address:')
print(scores_rural_median)

#calculate median difference
median_diff = scores_urban_median - scores_rural_median

#print median difference
print('Median difference:')
print(median_diff)
```

## Side-by-Side Box Plots

The difference in mean math scores for students at GP and MS was 0.64. How do we know whether this difference is considered small or large? To answer this question, we need to know something about the spread of the data.

One way to get a better sense of spread is by looking at a visual representation of the data. Side-by-side box plots are useful in visualizing mean and median differences because they allow us to visually estimate the variation in the data. This can help us determine if mean or median differences are “large” or “small”.

Let’s take a look at side by side boxplots of math scores at each school:
```
sns.boxplot(data = df, x = 'school', y = 'G3')
plt.show()
```

Looking at the plot, we can clearly see that there is a lot of overlap between the boxes (i.e. the middle 50% of the data). Therefore, we can be more confident that there is not much difference between the math scores of the two groups.

In contrast, suppose we saw the following plot, where the middle for one box-plot was greater than the other.

In this version, the boxes barely overlap, demonstrating that the middle 50% of scores are different for the two schools. This would be evidence of a stronger association between school and math score.

**Note to Remember**: 
1. If two boxplots are not similar (ie - they barely overlap), then this would be evidence of a stronger association between the two variables.
2. If two boxplots are similar (ie - they overlap a lot), then this would be evidence of a weak association between the two variables.

### Exercise
```
import pandas as pd
import codecademylib3
import matplotlib.pyplot as plt 
import seaborn as sns

students = pd.read_csv('students.csv')

#create the boxplot here:
sns.boxplot(data = students, x = 'address', y = "G3")
plt.show()
```

## Inspecting Overlapping Histograms

Another way to explore the relationship between a quantitative and categorical variable in more detail is by inspecting overlapping histograms. In the code below, setting `alpha = .5` ensures that the histograms are see-through enough that we can see both of them at once. We have also used `normed=True` make sure that the y-axis is a density rather than a frequency (note: the newest version of matplotlib renamed this parameter `density` instead of `normed`):

```
plt.hist(scores_GP , color="blue", label="GP", normed=True, alpha=0.5)
plt.hist(scores_MS , color="red", label="MS", normed=True, alpha=0.5)
plt.legend()
plt.show()
```

By inspecting this histogram, we can clearly see that the entire distribution of scores at GP (not just the mean or median) appears slightly shifted to the right (higher) compared to the scores at MS. However, there is also still a lot of overlap between the scores, suggesting that the association is relatively weak.

Note that there are only 46 students at MS, but there are 349 students at GP. If we hadn’t used `normed = True`, our histogram would have looked like this, making it impossible to compare the distributions fairly.

While overlapping histograms and side by side boxplots can convey similar information, histograms give us more detail and can be useful in spotting patterns that were not visible in a box plot (eg., a bimodal distribution). For example, the following set of box plots and overlapping histograms illustrate the same hypothetical data.

While the box plots and means/medians appear similar, the overlapping histograms illuminate the differences between these two distributions of scores.

**Note to Remember**:
1. If the overlapping histograms are similar (ie - there is a lot of overlap), then this is evidence for a relatively weak association.
2. If the overlapping histograms are not similar (ie - there is not a lot of overlap), then this is evidence for a relatively strong association. 

### Exercise

```
import numpy as np
import pandas as pd
import codecademylib3
import matplotlib.pyplot as plt 
students = pd.read_csv('students.csv')

scores_urban = students.G3[students.address == 'U']
scores_rural = students.G3[students.address == 'R']

#create the overlapping histograms here:
plt.hist(scores_urban, color = "blue", label = "Urban", normed = True, alpha = 0.5)
plt.hist(scores_rural, color = "red", label = "Rural", normed = True, alpha = 0.5)
plt.legend()
plt.show()
```

## Exploring Non-Binary Categorical Variables

In each of the previous exercises, we assessed whether there was an association between a quantitative variable (math scores) and a BINARY categorical variable (school). The categorical variable is considered binary because there are only two available options, either MS or GP. However, sometimes we are interested in an association between a quantitative variable and non-binary categorical variable. Non-binary categorical variables have more than two categories.

When looking at an association between a quantitative variable and a non-binary categorical variable, we must examine all pair-wise differences. For example, suppose we want to know whether or not an association exists between math scores (`G3`) and (`Mjob`), a categorical variable representing the mother’s job. This variable has five possible categories: `at_home`, `health`, `services`, `teacher`, or `other`. There are actually 10 different comparisons that we can make. For example, we can compare scores for students whose mothers work `at_home` or in `health`; `at_home` or `other`; `at home` or `services`; etc.. The easiest way to quickly visualize these comparisons is with side-by-side box plots:

```
sns.boxplot(data = df, x = 'Mjob', y = 'G3')
plt.show()
```

Visually, we need to compare each box to every other box. While most of these boxes overlap with each other, there are some pairs for which there are some apparent differences. For example, scores appear to be higher among students with mothers working in health than among students with mothers working at home or in an “other” job. If there are ANY pairwise differences, we can say that the variables are associated; however, it is more useful to specifically report which groups are different.

### Exercise
```
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import codecademylib3

students = pd.read_csv('students.csv')

#create the box-plot here:
sns.boxplot(data = students, x = "Fjob", y = "G3")
plt.show()
```

## Review
In this lesson, we used summary statistics and data visualization tools to examine an association between a quantitative and categorical variable. More specifically, we:

- evaluated mean and median differences
- inspected side-by-side box plots
- examined overlapping histograms
- looked at pair-wise comparisons for a quantitative and a non-binary categorical variable

After calculating a mean or median difference and visually comparing distributions, the next step might be to run a hypothesis test to look for evidence of population-level differences (will a similar difference in scores be observed for ALL students who ever attend these schools?). Now that you know how to investigate whether variables are associated, you can use these techniques to explore associations on more datasets.

Note that data in this lesson was downloaded from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance):

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [archive.ics.uci.edu/ml/index.php]. Irvine, CA: University of California, School of Information and Computer Science.

The data was originally collected by:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

### Exercise
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import codecademylib3

titanic = pd.read_csv('titanic.csv')

print(titanic.head())

#separate out fares by survival
fares_died = titanic.Fare[titanic.Survived == 0]
fares_survived = titanic.Fare[titanic.Survived == 1]

#mean difference
mean_fare_died = np.mean(fares_died)
mean_fare_surv = np.mean(fares_survived)
mean_diff = mean_fare_surv-mean_fare_died
print('mean difference: ')
print(mean_diff)

#median difference
med_fare_died = np.median(fares_died)
med_fare_surv = np.median(fares_survived)
med_diff = med_fare_surv-med_fare_died
print("median difference: ")
print(med_diff)

#create subplots (scroll to see plots)
fig = plt.figure(figsize = (10,20))

#create the boxplot:
ax = fig.add_subplot(2,1,1)
ax = sns.boxplot(data = titanic, x = 'Survived', y = 'Fare')

#create the histograms:
ax = fig.add_subplot(2,1,2)
ax = plt.hist(fares_died, color="blue", label="Died", normed=True, alpha=0.5)
ax = plt.hist(fares_survived, color="red", label="Survived", normed=True, alpha=0.5)
ax = plt.legend()
plt.show()
```

# Associations: Two Quantitative Variables

When associations exist between variables, it means that **information about the value of one variable gives us information about the value of the other variable**. In this lesson, we will cover ways of examining an association between two quantitative variables.

Throughout the next few exercises, we’ll examine some data about Texas housing rentals on Craigslist — an online classifieds site. The data dictionary is as follows:

- `price`: monthly rental price in U.S.D.
- `type`: type of housing (eg., `'apartment'`, `'house'`, `'condo'`, etc.)
- `sqfeet`: housing area, in square feet
- `beds`: number of beds
- `baths`: number of baths
- `lat`: latitude
- `long`: longitude

Except for `type`, all of these variables are quantitative. Which pairs of variables do you think might be associated? For example, does knowing something about price give you any information about square footage?

## Scatter Plots

One of the best ways to quickly visualize the relationship between quantitative variables is to plot them against each other in a scatter plot. This makes it easy to look for patterns or trends in the data. Let’s start by plotting the area of a rental against its monthly price to see if we can spot any patterns.

```
plt.scatter(x = housing.price, y = housing.sqfeet)
plt.xlabel('Rental Price (USD)')
plt.ylabel('Area (Square Feet)')
plt.show()
```

While there’s a lot of variation in the data, it seems like more expensive housing tends to come with slightly more space. This suggests an association between these two variables.

It’s important to note that different kinds of associations can lead to different patterns in a scatter plot. For example, the following plot shows the relationship between the age of a child in months and their weight in pounds. We can see that older children tend to weigh more but that the growth rate starts leveling off after 36 months:

If we don’t see any patterns in a scatter plot, we can probably guess that the variables are not associated. For example, a scatter plot like this would suggest no association:

**Note to Remember**:
1. Scatter plots are used to visually examine an association between two quantitative variables.
2. The pattern depicted in scatter plots can be used to determine whether a linear or non-linear association exists between variables.

### Exercise
```
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3

housing = pd.read_csv('housing_sample.csv')

print(housing.head())

#create your scatter plot here:
plt.scatter(x = housing.beds, y = housing.sqfeet)
plt.xlabel('Number of Beds')
plt.ylabel('Area (Square Feet)')
plt.show()
```

## Exploring Covariance

Beyond visualizing relationships, we can also use summary statistics to quantify the strength of certain associations. *Covariance* is a summary statistic that describes the strength of a linear relationship. A linear relationship is one where a straight line would best describe the pattern of points in a scatter plot.

Covariance can range from negative infinity to positive infinity. A positive covariance indicates that a larger value of one variable is associated with a **larger** value of the other. A negative covariance indicates a larger value of one variable is associated with a **smaller** value of the other. A covariance of **0** indicates no linear relationship. Here are some examples:

To calculate covariance, we can use the `cov()` function from NumPy, which produces a covariance matrix for two or more variables. A covariance matrix for two variables looks something like this:
```
            variable 1	variable 2
variable 1	variance(variable 1)	covariance
variable 2	covariance	variance(variable 2)
```

In python, we can calculate this matrix as follows:
```
cov_mat_price_sqfeet = np.cov(housing.price, housing.sqfeet)
print(cov_mat_price_sqfeet)
#output: 
[[184332.9  57336.2]
 [ 57336.2 122045.2]]
```

Notice that the covariance appears twice in this matrix and is equal to `57336.2`.

### Exercise
```
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True, precision = 1) 

housing = pd.read_csv('housing_sample.csv')

# calculate and print covariance matrix:
cov_mat_sqfeet_beds = np.cov(housing.sqfeet, housing.beds)
print(cov_mat_sqfeet_beds)

# store the covariance as cov_sqfeet_beds
cov_sqfeet_beds = 228.2
```

## Correlation - Part 1

Like covariance, *Pearson Correlation* (often referred to simply as “correlation”) is a scaled form of covariance. It also measures the strength of a linear relationship, but ranges from -1 to +1, making it more interpretable.

Highly associated variables with a positive linear relationship will have a correlation close to 1. Highly associated variables with a negative linear relationship will have a correlation close to -1. Variables that do not have a linear association (or a linear association with a slope of zero) will have correlations close to 0.

The `pearsonr()` function from `scipy.stats` can be used to calculate correlation as follows:
```
from scipy.stats import pearsonr
corr_price_sqfeet, p = pearsonr(housing.price, housing.sqfeet)
print(corr_price_sqfeet) #output: 0.507
```

Generally, a correlation larger than about .3 indicates a linear association. A correlation greater than about .6 suggestions a strong linear association.

### Exercise
```
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3
from scipy.stats import pearsonr

housing = pd.read_csv('housing_sample.csv')

# calculate corr_sqfeet_beds and print it out:
corr_sqfeet_beds, p = pearsonr(housing.sqfeet, housing.beds)
print(corr_sqfeet_beds)

# create the scatter plot here:
plt.scatter(housing.beds, housing.sqfeet)
plt.xlabel('Number of Beds')
plt.ylabel('Area (Square Feet)')
plt.show()
```

## Correlation Part 2

It’s important to note that there are some limitations to using correlation or covariance as a way of assessing whether there is an association between two variables. Because correlation and covariance both measure the strength of **linear** relationships with non-zero slopes, but not other kinds of relationships, correlation can be misleading.

For example, the four scatter plots below all show pairs of variables with near-zero correlations. The bottom left image shows an example of a perfect linear association where the slope is zero (the line is horizontal). Meanwhile, the other three plots show non-linear relationships — if we drew a line through any of these sets of points, that line would need to be curved, not straight!

1. Graph 1 looks like an upside-down N
2. Graph 2 looks like a U
3. Graph 3 looks like a horizontal line
4. Graph 4 looks like an upside-down U

### Exercise
```
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3
from scipy.stats import pearsonr

sleep = pd.read_csv('sleep_performance.csv')

# create your scatter plot here:
plt.scatter(x = sleep.hours_sleep, y = sleep.performance)
plt.xlabel("Hours of Sleep")
plt.ylabel("Performance")
plt.show() 
# Output: Upside down U

# calculate the correlation for `hours_sleep` and `performance`:
corr_sleep_performance, p = pearsonr(sleep.hours_sleep, sleep.performance)
print(corr_sleep_performance)
# Output: 0.2815
```

## Review

In this lesson we discussed several ways of examining an association between two quantitative variables. More specifically, we:

- Used scatter plots to examine relationships between quantitative variables
- Used covariance and correlation to quantify the strength of a linear relationship between two quantitative variables

Note that the dataset used in this lesson was downloaded [from kaggle](https://www.kaggle.com/austinreese/usa-housing-listings).

### Exercise
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import codecademylib3
from scipy.stats import pearsonr
np.set_printoptions(suppress=True, precision = 1) 

penguins = pd.read_csv('penguins.csv')

# Inspect the first few rows of data
print(penguins.head())

# Create a scatter plot of flipper length (flipper_length_mm) and body mass (body_mass_g).
plt.scatter(x = 'flipper_length_mm', y = 'body_mass_g', data = penguins)
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Body Mass (g)')
plt.show()
plt.close()

# Inspect your plot. What is the relationship between these variables?
# Answer: It looks to be a positive covariance and correlation between the two variables.

# Calculate the covariance for these two variables.
covariance_matrix = np.cov(penguins.flipper_length_mm, penguins.body_mass_g)
covariance = 9852.2
print(covariance) # 9852.2

# Calculate the correlation for these two variables. Does this number make sense given the plot you created?
correlation, p = pearsonr(penguins.flipper_length_mm, penguins.body_mass_g)
print(correlation) # 0.8729788985653615
```

# Associations: Two Categorical Variables

In this lesson, we will cover ways of examining an association between two categorical variables.

As an example, we’ll explore a sample of data from the Narcissistic Personality Inventory (NPI-40), a personality test with 40 questions about personal preferences and self-view. There are two possible responses to each question. The sample we’ll be working with contains responses to the following:

`influence`: `yes` = I have a natural talent for influencing people; `no` = I am not good at influencing people.
`blend_in`: `yes` = I prefer to blend in with the crowd; `no` = I like to be the center of attention.
`special`: `yes` = I think I am a special person; `no` = I am no better or worse than most people.
`leader`: `yes` = I see myself as a good leader; `no` = I am not sure if I would make a good leader.
`authority`: `yes` = I like to have authority over other people; `no` = I don’t mind following orders.

As you might guess, responses to some of these questions are associated. For example, if we know whether someone views themself as a good leader, we may also find that they’re more likely to like having authority. In this lesson we’ll learn how to assess whether an association exists between any two of these variables.

## Contingency Tables: Frequencies

Contingency tables, also known as two-way tables or cross-tabulations, are useful for summarizing two variables at the same time. For example, suppose we are interested in understanding whether there is an association between `influence` (whether a person thinks they have a talent for influencing people) and `leader` (whether they see themself as a leader). We can use the `crosstab` function from pandas to create a contingency table. The `crosstab` function outputs a table giving the number of observations in each unique combination of categories for two categorical variables:

```
influence_leader_freq = pd.crosstab(npi.influence, npi.leader)
print(influence_leader_freq)

# Output
leader       no   yes
influence            
no         3015  1293
yes        2360  4429
```

This table tells us the number of people who gave each possible combination of responses to these two questions. For example, 2360 people said that they do not see themselves as a leader but have a talent for influencing people.

To assess whether there is an association between these two variables, we need to ask whether information about one variable gives us information about the other. In this example, we see that among people who **don’t** see themselves as a leader (the first column), a larger number (3015) **don’t** think they have a talent for influencing people. Meanwhile, among people who **do** see themselves as a leader (the second column), a larger number (4429) **do** think they have a talent for influencing people.

So, if we know how someone responded to the leadership question, we have some information about how they are likely to respond to the influence question. This suggests that the variables are associated.

### Exercise
```
import pandas as pd
import codecademylib3

npi = pd.read_csv("npi_sample.csv")

# Do you think there will be an association between special (whether or not a person sees themself as “special”) and authority (whether or not a person likes to have authority)? 

# Create a contingency table for these two variables and store the table as special_authority_freq, then print out the result.
special_authority_freq = pd.crosstab(npi.special, npi.authority)
print(special_authority_freq)
# Output
# authority  no  yes
# special
# no  4069  1905
# yes 2229  2894
```

## Contingency Tables: Proportions

In the previous exercise, we looked at an association between the `influence` and `leader` questions using a contingency table of frequencies. However, sometimes it’s helpful to convert those frequencies to proportions. We can accomplish this simply by dividing the all the frequencies in a contingency table by the total number of observations (the sum of the frequencies):

```
influence_leader_freq = pd.crosstab(npi.influence, npi.leader)
influence_leader_prop = influence_leader_freq/len(npi)
print(influence_leader_prop)

# Output
leader           no       yes
influence                    
no         0.271695  0.116518
yes        0.212670  0.399117
```

The resulting contingency table makes it slightly easier to compare the proportion of people in each category. For example, we see that the two largest proportions in the table (.399 and .271) are in the yes/yes and no/no cells of the table. We can also see that almost 40% of the surveyed population (by far the largest proportion) both see themselves as leaders and think they have a talent for influencing people.

### Exercise

```
import pandas as pd
import numpy as np

npi = pd.read_csv("npi_sample.csv")

special_authority_freq = pd.crosstab(npi.special, npi.authority)

# save the table of proportions as special_authority_prop:
special_authority_prop = special_authority_freq / len(npi)

# print out special_authority_prop
print(special_authority_prop)

# Output
# authority. no yes
# special 
# no  0.366676  0.171668
# yes 0.200865  0.260791
```

## Marginal Proportions

In the previous exercises, we looked at an association between the `influence` and `leader` questions using a contingency table. We saw some evidence of an association between these questions.

Now, let’s take a moment to think about what the tables would look like if there were no association between the variables. Our first instinct may be that there would be .25 (25%) of the data in each of the four cells of the table, but that is not the case. Let’s take another look at our contingency table.

```
leader           no       yes
influence                    
no         0.271695  0.116518
yes        0.212670  0.399117
```

We might notice that the bottom row, which corresponds to people who think they have a talent for influencing people, accounts for 0.213 + 0.399 = 0.612 (or 61.2%) of surveyed people — more than half! This means that we can expect higher proportions in the bottom row, regardless of whether the questions are associated.

The proportion of respondents in each category of a single question is called a *marginal proportion*. For example, the marginal proportion of the population that has a talent for influencing people is 0.612. We can calculate all the marginal proportions from the contingency table of proportions (saved as `influence_leader_prop`) using row and column sums as follows:

```
leader_marginals = influence_leader_prop.sum(axis=0)
print(leader_marginals)
influence_marginals =  influence_leader_prop.sum(axis=1)
print(influence_marginals)

# Output
leader
no     0.484365
yes    0.515635
dtype: float64

influence
no     0.388213
yes    0.611787
dtype: float64
```

While respondents are approximately split on whether they see themselves as a leader, more people think they have a talent for influencing people than not.

### Exercise
```
import pandas as pd
import numpy as np

npi = pd.read_csv("npi_sample.csv")

# save the table of frequencies as special_authority_freq:
special_authority_freq = pd.crosstab(npi.special, npi.authority)

# save the table of proportions as special_authority_prop:
special_authority_prop = special_authority_freq/len(npi)
print(special_authority_prop)

# calculate and print authority_marginals
authority_marginals = special_authority_prop.sum(axis = 0)
print(authority_marginals)

# calculate and print special_marginals
special_marginals = special_authority_prop.sum(axis = 1)
print(special_marginals)

# Output
authority        no       yes
special                      
no         0.366676  0.171668
yes        0.200865  0.260791

authority
no     0.567541
yes    0.432459
dtype: float64

special
no     0.538344
yes    0.461656
dtype: float64
```
## Expected Contingency Tables

In the previous exercise we calculated the following marginal proportions for the `leader` and `influence` questions:
```
leader            influence
no     0.484      no     0.388
yes    0.516      yes    0.612
```

In order to understand whether these questions are associated, we can use the marginal proportions to create a contingency table of *expected proportions* if there were no **association** between these variables. To calculate these expected proportions, we need to multiply the marginal proportions for each combination of categories:
```
leader = no	leader = yes
influence = no	0.484*0.388 = 0.188	0.516*0.388 = .200
influence = yes	0.484*0.612 = 0.296	0.516*0.612 = 0.315
```

These proportions can then be converted to frequencies by multiplying each one by the sample size (11097 for this data):
```
leader = no	leader = yes
influence = no	0.188*11097 = 2087	0.200*11097 = 2221
influence = yes	0.296*11097 = 3288	0.315*11097 = 3501
```

This table tells us that **if** there were no association between the `leader` and `influence` questions, we would expect 2087 people to answer `no` to both.

In python, we can calculate this table using the `chi2_contingency()` function from SciPy, by passing in the observed frequency table. There are actually four outputs from this function, but for now, we’ll only look at the fourth one:
```
from scipy.stats import chi2_contingency
chi2, pval, dof, expected = chi2_contingency(influence_leader_freq)
print(np.round(expected))

Output:

[[2087. 2221.]
 [3288. 3501.]]
```

Note that the ScyPy function returned the same expected frequencies as we calculated “by hand” above! Now that we have the expected contingency table if there’s no association, we can compare it to our observed contingency table:
```
leader       no   yes
influence            
no         3015  1293
yes        2360  4429
```

The more that the expected and observed tables differ, the more sure we can be that the variables are associated. In this example, we see some pretty big differences (eg., 3015 in the observed table compared to 2087 in the expected table). This provides additional evidence that these variables are associated.

### Exercise
```
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

npi = pd.read_csv("npi_sample.csv")

special_authority_freq = pd.crosstab(npi.special, npi.authority)
print("observed contingency table:")
print(special_authority_freq)

# calculate the expected contingency table if there's no association and save it as expected
chi2, pval, dof, expected = chi2_contingency(special_authority_freq)

# print out the expected frequency table
print("expected contingency table (no association):")
print(np.round(expected))

# Output
observed contingency table:
authority    no   yes
special              
no         4069  1905
yes        2229  2894

expected contingency table (no association):
[[3390. 2584.]
 [2908. 2215.]]
```

## The Chi-Square Statistic

In the previous exercise, we calculated a contingency table of expected frequencies **if** there were no association between the `leader` and `influence` questions. We then compared this to the observed contingency table. Because the tables looked somewhat different, we concluded that responses to these questions are probably associated.

While we can inspect these tables visually, many data scientists use the *Chi-Square statistic* to summarize **how** different these two tables are. To calculate the Chi Square statistic, we simply find the squared difference between each value in the observed table and its corresponding value in the expected table, and then divide that number by the value from the expected table; finally add up those numbers:

$$
ChiSquare = \sum \frac{(observed - expected)^2}{expected}
$$

The Chi-Square statistic is also the first output of the SciPy function `chi2_contingency()`:
```
from scipy.stats import chi2_contingency
chi2, pval, dof, expected = chi2_contingency(influence_leader_freq)
print(chi2)
output: 1307.88
```

The interpretation of the Chi-Square statistic is dependent on the size of the contingency table. For a 2x2 table (like the one we’ve been investigating), a Chi-Square statistic larger than around 4 would strongly suggest an association between the variables. In this example, our Chi-Square statistic is much larger than that — 1307.88! This adds to our evidence that the variables are highly associated.

**Note to Remember**:
1. The Chi-Square Statistic measures the strength of an association between two categorical variables by comparing an expected contingency table (if there were no association) to an observed contingency table.

### Exercise
```
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

npi = pd.read_csv("npi_sample.csv")

special_authority_freq = pd.crosstab(npi.special, npi.authority)

# calculate the chi squared statistic and save it as chi2, then print it:
chi2, pval, dof, expected = chi2_contingency(special_authority_freq)
print(chi2)
```

## Review

In this lesson we used a few different methods to assess whether there was an association between two categorical variables. Although we used binary variables (only 2 options per category), it is important to note that the same techniques can be used for non-binary categorical variables. The methods we used in this lesson included:

- Contingency tables of frequencies
- Contingency tables of proportions
- Marginal proportions
- Expected contingency tables
- The Chi-Square statistic

Note that the data in this lesson was downloaded [from Kaggle](https://www.kaggle.com/lucasgreenwell/narcissistic-personality-inventory-responses), then cleaned and subsetted. The data was originally collected and made public by the [Open-Source Psychometrics Project](https://openpsychometrics.org/).






In [13]:
def get_variable_name(variable):
    for name in globals():
        if id(globals()[name]) == id(variable):
            return name
    for name in locals():
        if id(locals()[name]) == id(variable):
            return name
    return None
        
test = [123, 123, "a"]
x = get_variable_name(test)
print(x)

test
