# Creating a New Subset from a Dataset: Covid-19 in Colleges

## Overview
The following instructions will guide you on how to create a new subset of data that is taken from a dataset available to the public with the use of Python3.
* These instructions are designed for someone who has little to no experience with Python3 or coding with data, but is open to everyone.

The order in which the process will proceed is as follows:
1. Search up **Google Colab**
2. Create your own **New Notebook**
3. Take the necessary steps to **import your `.csv` file** into a **dataframe**
4. **Filter** the raw data to create a **subset** of that dataframe
5. **Export** the **new subset** as another `.csv`

## Getting Started
To begin, create a folder in your device that will store all of the files that will be necessary in this process.

* Search up **Google Colab** on your device and then proceed to making a **new notebook** in the bottom left

* Download the  `.csv` file that is labeled `colleges.csv` and move the file into your new folder.
It can also be downloaded from [this link](https://drive.google.com/file/d/1pEJc0ObWasUXlM9VdPPXm3xrMc-6LzZW/view?usp=drive_link)
* Be sure to include this **`.ipynb`** file as well in the same folder.

**Packages** that you will need to use in Google Colabs such as **Python, Pandas, and Numpy** are already included. You just need to **import** both **Pandas** and **Numpy**.

> To make things easier, import pandas **as pd** and import numpy **as np**.

This will make things more organized for you and easier to use functions with pandas. Follow the example below:


In [3]:
import numpy as np
import pandas as pd

It is now important to **mount** your **google drive** so you can access your saved data files.

> Mount your google drive by doing the following coding:

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Read the file with Pandas and then display the data to make sure it is working properly.
> The file can be read using the `.read_csv()` function by placing the file name inside the parenthesis along with including that it is from your google drive and the file you are keeping it in.

> You can assign the data any name you would like on the left side of the =, but I recommend labeling the data as df to keep it simple.

> Read and name the file by following the steps below:

In [4]:
df=pd.read_csv('gdrive/MyDrive/Repository Project/colleges.csv')

To display the file just simply code for the name of your data to be displayed as seen below:


In [5]:
display(df)

Unnamed: 0,date,state,county,city,ipeds_id,college,cases,cases_2021,notes
0,5/26/2021,Alabama,Madison,Huntsville,100654,Alabama A&M University,41,,
1,5/26/2021,Alabama,Montgomery,Montgomery,100724,Alabama State University,2,,
2,5/26/2021,Alabama,Limestone,Athens,100812,Athens State University,45,10.0,
3,5/26/2021,Alabama,Lee,Auburn,100858,Auburn University,2742,567.0,
4,5/26/2021,Alabama,Montgomery,Montgomery,100830,Auburn University at Montgomery,220,80.0,
...,...,...,...,...,...,...,...,...,...
1943,5/26/2021,Wisconsin,Milwaukee,Milwaukee,240338,Wisconsin Lutheran College,143,23.0,
1944,5/26/2021,Wyoming,Natrona,Casper,240505,Casper College,376,46.0,
1945,5/26/2021,Wyoming,Goshen,Torrington,240596,Eastern Wyoming College,17,5.0,
1946,5/26/2021,Wyoming,Albany,Laramie,240727,University of Wyoming,2087,292.0,


**Good Job!** You should now be able to see your raw dataset in Google Colabs.

If you are having any issues in seeing your data, make sure to double check that the file name you placed within the parenthesis is matching the file name that you have saved in your folder in your google drive.

## Creating the Subset

Now that you have the raw data, you can now work on creating your subset of the data.


1. Create a index for `'state'` in our dataframe so we can isolate a specific state for our subset.

> The inner statement for this code should contain `'state'`

> The outer statement for this code should contain `df.set_index()`

2. Assign a simple name for this code for the subset.

In [7]:
df_indexed = df.set_index('state')

3. Create a filtering command that isolates every datapoint consisting of `North Carolina` as the `'state'` using the `.loc` command.

> The `.loc` command allows for us to isolate our subset into only using data from the columns we want as well as only using the data for the rows containing **North Carolina**

> To do so, you want to include **"North Carolina"** on the left side of the comma in the brackets to indicate we only want rows that contain `North Carolina`

> To isolate the columns we want for our subset, we will include **"college","cases","cases_2021"** inside of their own set of brackets that is placed on the right side of the comma in the original set of brackets.

> The inner statement for this code should contain `"North Carolina", ["college","cases","cases_2021"]`

> The outer statement for this code should contain `df_indexed.loc[]`

Once again, like before, make sure to assign a simple and easy name for the subset.

In [8]:
NC_Subset = df_indexed.loc["North Carolina", ["college","cases","cases_2021"]].copy()

> Include `.copy()` at the end of the code to ensure that no errors occur later on with `SettingwithCopyWarning`.

4. Display the subset and make sure it is working as intended.

In [9]:
display(NC_Subset)

Unnamed: 0_level_0,college,cases,cases_2021
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
North Carolina,Appalachian State University,1954,612.0
North Carolina,Barton College,198,90.0
North Carolina,Belmont Abbey College,66,
North Carolina,Brevard College,126,106.0
North Carolina,Campbell University,397,248.0
North Carolina,Carolinas College of Health Sciences,25,7.0
North Carolina,Catawba College,184,139.0
North Carolina,Chowan University,46,21.0
North Carolina,Davidson College,217,169.0
North Carolina,Duke University,1202,935.0


**CONGRATULATIONS!** You now have your working subset of the data!

Now the final step is to export this newly formed subset.

## Exporting Subset

Exporting our subset of data involves using the `.to_csv()` method.

1. The name of the file goes inside the parenthesis, followed by `index=False` to avoid the uneeded columns of indices that Pandas adds.

> Once again make sure to name the file something simple and easy to remember.

In [10]:
NC_Subset.to_csv("NC_Subset.csv", index=False)

2. The file for the subset should now appear in the files of your Colab for the project.

> If it is not appearing as a `.csv` file for you then double check to confirm the name you gave the file as well as confirming you included the `.csv` extension at the end of it.

**CONGRATULATIONS!!** You have now fully created and exported a subset of the data!