# 🚀 Project

* * * 

### Icons Used In This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💭 **Reflection**: Helping you think about programming.<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [🚀 Project](#project)

# 🚀 Project

### Data: California Health Interview Survey
In this section, we will go through an example project. The California Health Interview Survey (CHIS) is the nation's largest state health survey and a critical source of data on Californians as well as on the state's various racial and ethnic groups.

We will bring together the basic programming, loading data, and statistical analysis/visualization techniques from this workshop to analyze this data. 

First, let's import the packages to use in this analysis:

In [None]:
import numpy as np
import pandas as pd
import os

## 1. Getting the data

Before we can get our data, you should know something more about **filepaths**. 

A filepath is the location of a file on your system. There are two kinds of filepaths:

* **absolute**: The filepath from the top level directory (or folder).
    * For Macs, these begin with a forward slash, followed by folders separated by a **forward slash**. E.g. `/Users/[USERNAME]/directory/subdirectory/file`.
    * For Windows, these begin with a backward slash or, more commonly, a volume, e.g. `C:\Documents\directory\subdirectory\file`. Note the **backward slash** to separate folders.
* **relative**: The filepath relative to the current working directory (i.e. notebook location). Common locations include:
    * File in same folder: `./file` or `file` (`.` means 'here').
    * Subfolder: `subfolder/file`.
    * Higher folder: `../sisterfolder/file` (`..` means 'go up one level in the directory').

When you are figuring out what filepath to use, you can use `os.listdir([PATH])` to list all subdirectories in a path. For example, let's see what directories are available to us in the current folder (noted with a dot `.`).

🔔 **Question**: In this current folder we're checking out, which items are folders and which are files? (**Hint:** You can double check by looking at the files in JupyterLab/ Jupyter Notebook).

In [None]:
import os
os.listdir('.')

Looking up the items in the folder after moving up one level works like this:

In [None]:
os.listdir('../')

### 1.1 Find the Data

Use `os.listdir()` to see the files in the "chis_data" folder, which is in the "data" folder, which is in the main "Python-Fundamentals" folder.

💡 **Tip**: Remember how to move up in the folder structure? `../../` goes up two folders!

💡 **Tip**: You can use Jupyter Lab's File Browser to the left of your screen to get a sense of where the "chis_data" folder is.

In [None]:
# YOUR CODE HERE


### 1.2 Load in a single file

We have 3 csv files based on the language in which the Health Interview was held.
L
et's load in one of these CSV files.

1. Read in the `chis_esp.csv` file as a `pandas` object.
2. How many rows are there? How many columns?

In [None]:
import pandas as pd

# Load in file
chis_esp = pd.read_csv('../../data/chis_data/chis_esp.csv')

### 1.3 Load in Multiple Files

It looks like the CSV files we have are sorted by language. We want to combine these files in one big dataframe using a loop.
However, we notice that there is a `.txt` file in the directory, which isn't a `pandas` dataframe. This will cause an error in the dataframe processing, so let's use an `if` statement to filter out the `.txt` extension. 

Slice the last 3 characters of the `test_csv` variable and use the equality operator (`==`) to return `True`.

💡 **Tip**: Recall slicing the last elements of a list. For instance, use `some_list[-2:]` to get the last two items.

In [None]:
test_csv = 'chis_esp.csv' # Expression should evaluate True

# YOUR CODE HERE


Now that we have an expression, let's create a for-loop to check if it works over the files in our folder. 

In [None]:
directory = '../../data/chis_data'
for file in os.listdir(directory):
    if file[-3:] == 'csv': # Fill in the blank to filter for files ending with `.csv`
        print(file)

## 🥊 Challenge: Putting it all together

We've got most of the pieces. Now let's put the puzzle together:
1) Initialize an *accumulator* list called `df_list`.
2) Reuse the `for`-loop we just created to loop over the csv files in the right folder. But in the final line, instead of `print`ing the file, read it as a DataFrame and `append` the output to our `df_list` list.

⚠️ Warning: When calling `read_csv()`, you will need to input the **full filepath**, or the file will not be found!

In [None]:
# YOUR CODE HERE    


Finally, look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the list of DataFrames we have now. We'll save the concatenated list in a variable called `df`.

In [None]:
# YOUR CODE HERE


🔔 **Question**: Let's take a look at the final data frame.

1. How many rows and columns are there in the total dataframe?
2. How many numeric columns are there in the dataset?
3. Which columns look interesting to you?

In [None]:
# YOUR CODE HERE


## 2. Data Processing

### 2.1 Exploratory Data Analysis (EDA)

Now let's do some Exploratory Data Analysis. One thing we will want to do is count values of interesting features. Run `value_counts()` on the `feel_left_out` column and normalize the output.


In [None]:
# YOUR CODE HERE



One thing we will want to do is look at potential correlations between features that we think might be interesting to pursue further. 

Pick two of them, then 



### 2.2 Quantiles

We can use the `quantile()` method to calculate the q-th quantile of the data along a specified axis. Try to find the amount of fruits eaten by the top 1% of respondents.

In [None]:
# YOUR CODE HERE


### 2.3 Cross-tabulate
We can use the `crosstab()` method to cross-tabulate poverty level and health. 

In [None]:
pd.crosstab(index=df['poverty_level'], columns=df['general_health'])

Look at the crosstab [documentation](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) and look for aruments that allow you to get the normalized values, and include subtotals.

Get the cross tab of `poverty_level` and `general_health`. Normalize them and print the subtotals. 

In [None]:
# YOUR CODE HERE


### 2.4 Grouping

Use `group_by()` to get the **means** of the amount of veggies eaten per week (`veg_perweek`) when grouping by `general_health`. Sort the mean from small to large.

In [None]:
# YOUR CODE HERE


### 2.5 Dummy variables
We can create dummy variables using the `get_dummies()` method.
Use `get_dummies()` to dummify the `general_health` column.

In [None]:
# YOUR CODE HERE


### 2.6 Visualizing Correlations

Let's try to find out if there's a correspondence between **poverty level** and **poor general health**.

Getting the correlations between features is easily done using `.corr()`

In [None]:
pov_health.corr()

Use [Matplotlib](https://matplotlib.org/) or [Seaborn](https://seaborn.pydata.org/) to visualize the correlation using a barplot. Click on them to read their documentation if you need a refresher!

In [None]:
# YOUR CODE HERE


## Writing Files

Finally, a `pd.DataFrame` can be exported to a `.csv` (or other filetype) using `df.to_csv()`. This is a method function built-in to every data frame.

🔔 **Question**:  Where does `chis_total.csv` get saved? What if you wanted to save it to the "data" directory?

In [None]:
df.to_csv('chis_total.csv') 

# 🎉 Well done!

**This concludes Python Fundamentals II!**

Today's project took us through importing multiple csv files, data manipulation, and some basic visualizations and analysis of data. 

If you were working on this dataset, what would you potentially do next? It could be either an analysis, a new feature to include, a visualization that might help represent the data, etc.

### 💡 Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).