# Python Fundamentals II: Project

* * * 

### Icons Used In This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Sections
1. [🚀 Project](#project)

<a id='project'></a>


# 🚀 Project

### Data: California Health Interview Survey
The [California Health Interview Survey (CHIS)](https://healthpolicy.ucla.edu/chis/Pages/default.aspx)(CHIS) is the nation's largest state health survey and a critical source of data on Californians as well as on the state's various racial and ethnic groups. The data has been altered for demonstration purposes.

The data has the following columns:

- `general_health`: Self-Reported assessment of general health
- `fruit_perweek`: How many pieces of fruit consumed per week
- `veg_perweek`: How many vegetables consumed per week
- `feel_left_out`: How often feeling left out
- `poverty_level`: Poverty level as Times of 100% Federal Poverty Line (FPL)
- `household_tenure`: Self-Reported household tenure
- `interview_language`: Language of interview

We will bring together the basic programming, loading data, and statistical analysis/visualization techniques from this workshop to analyze this data. 

First, let's import the packages to use in this analysis:

In [None]:
import numpy as np
import pandas as pd
import os

# Getting the Data

Before we retrieve our data, a few comments about **filepaths**. 

A filepath is the location of a file on your system. There are two kinds of filepaths:

* **absolute**: The filepath from the top level folder of your system.
    * For Mac and Linux, these begin with a **forward slash**, followed by folders separated by forward slashes. E.g. `/Users/[USERNAME]/directory/subdirectory/file`.
    * For Windows, these typically begin with a volume, e.g. `C:\Documents\directory\subdirectory\file`. Note the **backward slash** to separate folders.
* **relative**: The filepath relative to the current working directory (i.e. this notebook's location). 
    * File in same folder: `./file` or `file` (`.` means 'here').
    * Subfolder: `subfolder/file`.
    * Higher folder: `../sisterfolder/file` (`..` means 'go up one level in the directory').

💡 **Tip**: When you are figuring out what filepath to use, you can use `os.listdir([PATH])` to list all subdirectories in a path. For example, let's see what directories are available to us in the current folder (noted with a dot `.`).

In [None]:
import os
os.listdir('.')

Looking up the items in the folder after moving up one level works like this:

In [None]:
os.listdir('../')

## 🥊 Challenge 1: Find the Data

Try to locate the files in the "chis_data" folder, which is in the "data" folder, which is in the main "Python-Fundamentals" folder. Using `pd.read_csv()`, read in all three data frames and assign them to the three variables defined below.

💡 **Tip**: You can use Jupyter Lab's File Browser to the left of your screen to get a sense of where the "chis_data" folder is.

💡 **Tip**: As a reminder, here's how we loaded in data in the previous notebook:

```pd.read_csv('../data/gapminder-FiveYearData.csv')```

In [None]:
# YOUR CODE HERE
df_eng = ...
df_esp = ...
df_other = ...

## 🥊 Challenge 2: Concatenate

Look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the three DataFrames we have now. Save the concatenated list in a new variable called `df`.

In [None]:
# YOUR CODE HERE


🔔 **Question**: Let's take a look at the final data frame.

1. How many rows and columns are there in the total dataframe?
2. How many numeric columns are there in the dataset?
3. Which columns look interesting to you?

In [None]:
# YOUR CODE HERE


# Exploratory Data Analysis (EDA)

## 🥊 Challenge 3: Data Cleaning 

Often, we will want to remove some missing values in a data frame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values.

In [None]:
# YOUR CODE HERE


Get rid of the non-existent values in this column with the `.dropna()` method. Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) to see how to do this.

💡 **Tip**: Use the `subset` argument to select a specific column to remove values from. 

In [None]:
# YOUR CODE HERE


## 🥊 Challenge 4: Counting Values

Now let's do some Exploratory Data Analysis. One thing we will want to do is count values of interesting features. Run `value_counts()` on the `feel_left_out` column and normalize the output.


In [None]:
# YOUR CODE HERE


## 🥊 Challenge 5: Quantiles

We can use the `quantile()` method to calculate the q-th quantile of the data along a specified axis. Try to find the amount of fruits eaten by the top 1% of respondents.

In [None]:
# YOUR CODE HERE


## 🥊 Challenge 6: String manipulation

The `household_tenure` column consists of categorical string values. First, let's see what they are. Use the `.unique()` method on the column to check.

In [None]:
# YOUR CODE HERE


If we want to change the string for `RENT/SOME OTHER ARRANGEMENT`, we can do so using Pandas `str` methods. Read the [documentation](https://pandas.pydata.org/docs/user_guide/text.html) and find a way to `replace` the string to `RENT`. Make sure to assign the replaced `Series` in the data frame.

In [None]:
# YOUR CODE HERE


## 🥊 Challenge 7: Cross-tabulate
We can use the `crosstab()` method to cross-tabulate poverty level and health. 

In [None]:
pd.crosstab(index=df['poverty_level'], columns=df['general_health'])

Look at the crosstab [documentation](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) and look for aruments that allow you to get the normalized values, and include subtotals.

Get the cross tab of `poverty_level` and `general_health`. Normalize them and print the subtotals. 

In [None]:
# YOUR CODE HERE


## 🥊 Challenge 8: Grouping

Use `groupby()` to get the **means** of the amount of veggies eaten per week (`veg_perweek`) when grouping by `general_health`. Sort the mean from small to large.

In [None]:
# YOUR CODE HERE


## 🥊 Challenge 9: Visualizing Correlations

Let's try to find out if there's a correspondence between `fruit_perweek` and `veg_perweek`.

Getting the correlations between features is done using `.corr()`

In [None]:
# YOUR CODE HERE


Use Pandas' `.plot()` method to visualize the correlation with a scatterplot.

💡 **Tip**: Use the argument `type=scatter` to get a scatterplot.

In [None]:
# YOUR CODE HERE


## Writing Files

Finally, a `pd.DataFrame` can be exported to a `.csv` (or other filetype) using `df.to_csv()`. This is a method function built-in to every data frame.

🔔 **Question**:  Where does `chis_total.csv` get saved? What if you wanted to save it to the "data" directory?

In [None]:
df.to_csv('chis_total.csv') 

# 🎉 Well done!

Today's project took us through importing multiple .csv files, data analysis, and some basic visualizations and analysis of data. 

### 💡 Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).