# Working with data downloads

Almost all of the data you work with as a data scientist will come from a **remote source**, such as another website on the Internet. File downloads sometimes come in analysis-ready formats like CSV. At other times, the data will be in an archive format like TAR or ZIP. These formats compress files to reduce overall size, which makes them faster to download. Archive formats can also bundle multiple data files into a single archive file.

## Data set

The [data set](https://catalog.data.gov/dataset/civil-rights-data-collection-2013-14) we'll work with is called the **Civil Rights Data Collection**. It contains information on **educational achievement and opportunities in the U.S.**, broken down **by race and school**. For example, it records the racial composition of the students enrolled in advanced classes at each school. Each row represents a school, while each column records an indicator of academic achievement.

### Unzipping
Before we can load and analyze the data, we'll need to **extract the files that contain it from the archive file**, crdc201314csv.zip. We can call the unzip command on an archive file to extract the files within it.

In [2]:
# unzip crdc201314csv.zip

This command will **extract all of the files from the archive** into the current directory. Once we've extracted the files inside an archive file, it's good practice to delete the original archive to save space.

Activate virtualenv with running 

```source /dataquest/system/env/python3/bin/activate```


In [18]:
import pandas as pd



if __name__ == "__main__":
   print("program successfully executed")
   
   contents = pd.read_csv("crdc201314csv/CRDC2013_14_SCH_content.csv")
   #print(contents.head(15))
    
   data = pd.read_csv("crdc201314csv/CRDC2013_14_SCH.csv", encoding="Latin-1")
   
   print(data['JJ'].value_counts())
   print(data['SCH_STATUS_MAGNET'].value_counts())

   print(pd.pivot_table(data, values=["TOT_ENR_M", "TOT_ENR_F"], index="JJ", aggfunc="sum"))


program successfully executed


  interactivity=interactivity, compiler=compiler, result=result)


NO     94874
YES      633
Name: JJ, dtype: int64
NO     91743
YES     3749
-5        15
Name: SCH_STATUS_MAGNET, dtype: int64
     TOT_ENR_F  TOT_ENR_M
JJ                       
NO    24317962   25677023
YES       5791      34890
