# Reading Files with Pandas

In this section we will use Pandas to explore tabular datasets (mainly in CSV form)

## Input Data

Our input data contains various data files from the New York Times about Covid-19 spread across the entire world. You can download the files from their GitHub repository https://github.com/nytimes/covid-19-data

As a starter we will look into the file facilities.csv which contain the details of Covid19 spread in correctional facilities. During the pandemic outbreaks were common in confined facilities such as prisons, carehomes, and nursing homes. 

## Reading a CSV File In Pandas

For reading a CSV file we utilize the read_csv() function from Pandas. The read_csv function expects a file-path and it also has many optional parameters to modify the way in which the file is read. 

In [1]:
import pandas as pd

If you want to know the details about the function read_csv you can do the following

In [2]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m=[0m[1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m=[0m[1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m=[0m[1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0msqueeze[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mprefix[0m[1;33m=[0m[1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mmangle_dupe_cols[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'DtypeArg | None'[0m [1;33

In [3]:
prisonData = pd.read_csv('data/facilities.csv')  #We are passing just the filepath here

Quick question. What is the datatype for prisonData. 

If you want to just see the first few records or last few records you can use the methods head and tail respectively

In [4]:
prisonData.head()

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
0,F3EFE858,Alex City Work Release prison,Low-security work release,Alex City,Coosa,1037,Alabama,-86.009015,32.904507,188.0,,77,0,17,0.0,
1,5B910220,Alabama Therapeutic Education Facility prison,State rehabilitation center,Columbiana,Shelby,1117,Alabama,-86.624067,33.180755,272.0,,11,1,2,0.0,
2,02FB1675,Bibb Correctional Facility,State prison,Brent,Bibb,1007,Alabama,-87.162781,32.920754,1725.0,1825.0,164,3,61,0.0,
3,6378F6C4,Birmingham Women's Community Based Facility an...,State prison,Birmingham,Jefferson,1073,Alabama,-86.808344,33.531101,192.0,,17,0,28,0.0,
4,EAABF900,Bullock Correctional Facility,State prison,Bessemer,Bullock,1011,Alabama,-85.673927,32.147144,1477.0,1577.0,162,5,80,1.0,


In [5]:
prisonData.tail()

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
2634,15289545,North Lake federal prison,Federal prison,Baldwin,Lake,26085,Michigan,-85.839287,43.928551,1614.0,1614.0,125,2,0,0.0,
2635,1558C2BF,Rivers federal prison,Federal prison,Winton,Hertford,37091,North Carolina,-76.958751,36.403668,6.0,1255.0,68,1,0,0.0,
2636,C9CF62B9,Reeves County federal prison,Federal prison,Pecos,Reeves,48389,Texas,-103.493817,31.423563,1005.0,,46,1,0,0.0,
2637,364869B9,Flightline federal prison,Federal prison,Big Spring,Howard,48227,Texas,-101.521236,32.22431,1652.0,,34,1,0,0.0,
2638,5CFB7978,Moshannon Valley federal prison,Federal prison,Philipsburg,Clearfield,42033,Pennsylvania,-78.242177,40.921117,94.0,1774.0,197,4,0,0.0,


How many rows and columns are there in our prison dataset

Can you enumerate the columns and the index in this dataset

What are the datatypes for each columns

Can you select the columns 'total_inmate_cases' from the dataset. What is the datatype for the column?

Can you select the columns 'nyt_id', 'facility_name','facility_state','latest_inmate_population' and 'total_inmate_cases' from the dataset

Some basic descriptive stats for the dataset

In [6]:
prisonData.describe()

Unnamed: 0,facility_county_fips,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths
count,2639.0,2639.0,2639.0,1593.0,838.0,2639.0,2639.0,2639.0,2638.0
mean,29186.422887,-91.193161,37.280626,822.386692,1269.529833,199.431603,0.983327,42.561955,0.083776
std,15809.43962,14.836097,5.437975,862.009973,936.057182,420.344387,2.982973,89.408213,0.363528
min,1003.0,-165.412569,18.423135,1.0,27.0,0.0,0.0,0.0,0.0
25%,13233.0,-97.649683,33.422542,163.0,602.5,8.0,0.0,0.0,0.0
50%,30017.0,-86.812013,37.606861,579.0,1066.5,59.0,0.0,9.0,0.0
75%,42033.0,-81.115561,40.93327,1197.0,1681.0,226.5,0.0,50.5,0.0
max,72061.0,-66.112209,64.833337,8616.0,6150.0,12290.0,45.0,1718.0,6.0


While running descriptive stats on some of the columns such as facility_county_fips doesn't make any sense, it can easily provide an overview for columns such as total_inmate_cases (mean = 199.431603,max = 12290.000000). 

Can you just print the descriptive statistics for total_inmate_cases and total_inmate_deaths

## Some basic calculations

Create a new field cases_per_population which is calculated as total_inmate_cases/latest_inmate_population

Create a new field deaths_per_population which is calculated as total_inmate_deaths/latest_inmate_population

Find the correctional facility with the maximum number of inmate cases?

Find the correctional facility with the maximum number of inmate deaths?

## Selections

 Can you select the first 10 rows of this dataset. 

Can you select the last 100 rows for the total_inmate_cases.

Can you select the last 100 rows for the total_inmate_cases and total_inmate_deaths.

Can you select the last 100 rows for the total_inmate_cases and calculate the mean.

## Filtering and updating data

Let us find all facilities with more than 1000 inmate cases

In [7]:
prisonData[prisonData.total_inmate_cases>1000]

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
34,7120943C,Goose Creek Correctional Center,State prison,Wasilla,Matanuska-Susitna,2170,Alaska,-149.990022,61.360095,1293.0,1348.0,1041,0,0,0.0,
50,AA807759,Arizona State Prison Complex – Douglas,State prison,Douglas,Cochise,4003,Arizona,-109.600147,31.445500,1848.0,2132.0,1163,0,0,0.0,
51,91B22ADD,Eyman Complex prison,State prison,Florence,Pinal,4021,Arizona,-111.337569,33.031474,5362.0,5471.0,2023,5,0,0.0,
52,0CF96DB1,Arizona State Prison Complex – Yuma,State prison,San Luis,Yuma,4027,Arizona,-114.643082,32.494131,4316.0,4862.0,2010,6,0,0.0,
58,E830C278,Arizona State Prison Complex – Lewis,State prison,Buckeye,Maricopa,4013,Arizona,-112.644365,33.201281,4617.0,4617.0,1310,2,0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2568,233CC887,Pollock federal prison complex,Federal prison,Pollock,Grant,22043,Louisiana,-92.432853,31.465159,2590.0,2754.0,1211,0,0,0.0,
2588,79BEB932,Seagoville federal prison,Federal prison,Seagoville,Dallas,48113,Texas,-96.567986,32.655933,1798.0,2004.0,1336,5,0,0.0,
2599,4A388A9A,Terre Haute federal prison complex,Federal prison,Terre Haute,Vigo,18167,Indiana,-87.455019,39.410206,2291.0,2624.0,1248,6,0,0.0,
2608,D9A37098,Tucson federal prison complex,Federal prison,Tucson,Pima,4019,Arizona,-110.865276,32.084730,1586.0,1931.0,1067,10,0,0.0,


Now can you try to create a subset of our dataset containing only facilities from Ohio?

Can you try to find out facilties in Franklin County, Ohio with more than 30 inmate cases. 

Which is the facility in Ohio with most number of inamte deaths

## Dealing with missing data

Data gaps are common in real-world datasets. Let us see the records with missing latest_inmate_population.

In [8]:
prisonData[prisonData['latest_inmate_population'].isna()]

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
39,ED29F8B4,Palmer Correctional Center,State prison,Palmer,Matanuska-Susitna,2170,Alaska,-149.007972,61.691417,,,0,0,0,0.0,
57,07E5D203,Maricopa Reentry Center,State prison,Phoenix,Maricopa,4013,Arizona,-112.120433,33.707162,,,0,0,0,0.0,
61,26CFE1E6,Pima Reentry Center,State prison,Tucson,Pima,4019,Arizona,-110.990714,32.204199,,,0,0,0,0.0,
79,9AA13461,Northwest Arkansas Work Release prison,Low-security work release,Springdale,Washington,5143,Arkansas,-94.135712,36.174948,,,69,0,4,0.0,
88,7C68181A,Arkansas Juvenile Assessment and Treatment Center,State juvenile detention,Alexander,Saline,5125,Arkansas,-92.494695,34.637731,,,63,0,67,4.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2620,786CB2A3,Roseville federal prison halfway house,Federal halfway house,Roseville,Ramsey,27123,Minnesota,-93.124231,44.994233,,,8,0,0,0.0,
2622,175CAB9F,Watkinson House federal prison halfway house,Federal halfway house,Hartford,Hartford,9003,Connecticut,-72.689572,41.772700,,,1,0,0,0.0,
2624,87DBAF80,Parsons House federal prison halfway house,Federal halfway house,Milwaukee,Milwaukee,55079,Wisconsin,-87.960541,43.039124,,,1,0,0,0.0,
2625,E3A87C51,Flagstaff federal prison halfway house,Federal halfway house,Flagstaff,Coconino,4005,Arizona,-111.640458,35.187256,,,1,0,0,0.0,


Wowwww that's a lot of records. 

Now can you filter out the records that doesnot have a null value for latest_inmate_population

## Data types

Let us see the datatype for the column facility_county_fips

In [9]:
prisonData['facility_county_fips'].dtype

dtype('int64')

Can you convert this column to string?

## Finding Unique Values

We can use unique() method to extract out unique values from a column. For example let's extract out the unique values from facility_state column. 

In [10]:
prisonData['facility_state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Texas', 'Illinois', 'Indiana', 'Iowa',
       'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming', 'District of Columbia', 'kentucky', nan,
       'Puerto Rico'], dtype=object)

## Sorting data

DataFrames/Series can be sorted using sort_values() method. Let us sort our dataset using total_inmate_cases in descending order.

In [11]:
prisonData.sort_values(by='total_inmate_cases',ascending=False)

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
2247,E5DA4853,U.S. Marshalls,U.S. Marshalls,,Muskogee,40101,,-95.374390,35.749979,,,12290,31,0,0.0,
1458,046ADCE7,Fresno County jail,Jail,Fresno,Fresno,6019,California,-119.789751,36.738211,2042.0,,3814,0,171,1.0,
92,41367DB9,Avenal State Prison,State prison,Avenal,Kings,6031,California,-120.123310,35.974332,3469.0,4158.0,3108,8,532,0.0,
120,19DE7187,Substance Abuse Treatment Facility and State P...,State prison,Corcoran,Kings,6031,California,-119.554854,36.058048,4488.0,4531.0,3011,7,636,0.0,
106,550543A1,Soledad prison,State prison,Soledad,Monterey,6053,California,-121.382640,36.470807,4374.0,,2719,18,355,2.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2374,0A1D2EB9,Carver County Jail,Detention center,Chaska,Carver,27019,Minnesota,-93.592635,44.787408,,,0,0,0,0.0,
1924,C75A68C0,Vance County Detention Center,Jail,Henderson,Vance,37181,North Carolina,-78.407240,36.330646,,,0,0,3,0.0,
2372,3867C0A8,Sheridan federal prison,Detention center,Sheridan,Yamhill,41071,Washington,-123.381623,45.084014,,,0,0,0,0.0,
344,5E4670FA,Melbourne Center for Personal Growth youth center,State juvenile detention,Melbourne,Brevard,12009,Florida,-80.713169,28.122853,12.0,,0,0,6,0.0,


As you can see the U.S. Marshalls facility has the highest number of total_inmate_cases followed by Fresno County jail.

Can you sort the dataset using total_inmate_deaths in descending order. 

## Writing data to a file

We can use the to_csv() method to write a DataFrame to a CSV file. Let's write a subset of our file containing all the facilities with 1,000 or more inmate cases to a file called facilities_1000ormore.csv to the data folder.

In [12]:
#let's create the subset
prisonData1000OrMore = prisonData[prisonData['total_inmate_cases']>=1000]
prisonData1000OrMore

Unnamed: 0,nyt_id,facility_name,facility_type,facility_city,facility_county,facility_county_fips,facility_state,facility_lng,facility_lat,latest_inmate_population,max_inmate_population_2020,total_inmate_cases,total_inmate_deaths,total_officer_cases,total_officer_deaths,note
34,7120943C,Goose Creek Correctional Center,State prison,Wasilla,Matanuska-Susitna,2170,Alaska,-149.990022,61.360095,1293.0,1348.0,1041,0,0,0.0,
50,AA807759,Arizona State Prison Complex – Douglas,State prison,Douglas,Cochise,4003,Arizona,-109.600147,31.445500,1848.0,2132.0,1163,0,0,0.0,
51,91B22ADD,Eyman Complex prison,State prison,Florence,Pinal,4021,Arizona,-111.337569,33.031474,5362.0,5471.0,2023,5,0,0.0,
52,0CF96DB1,Arizona State Prison Complex – Yuma,State prison,San Luis,Yuma,4027,Arizona,-114.643082,32.494131,4316.0,4862.0,2010,6,0,0.0,
58,E830C278,Arizona State Prison Complex – Lewis,State prison,Buckeye,Maricopa,4013,Arizona,-112.644365,33.201281,4617.0,4617.0,1310,2,0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2568,233CC887,Pollock federal prison complex,Federal prison,Pollock,Grant,22043,Louisiana,-92.432853,31.465159,2590.0,2754.0,1211,0,0,0.0,
2588,79BEB932,Seagoville federal prison,Federal prison,Seagoville,Dallas,48113,Texas,-96.567986,32.655933,1798.0,2004.0,1336,5,0,0.0,
2599,4A388A9A,Terre Haute federal prison complex,Federal prison,Terre Haute,Vigo,18167,Indiana,-87.455019,39.410206,2291.0,2624.0,1248,6,0,0.0,
2608,D9A37098,Tucson federal prison complex,Federal prison,Tucson,Pima,4019,Arizona,-110.865276,32.084730,1586.0,1931.0,1067,10,0,0.0,


In [13]:
prisonData1000OrMore.to_csv(r'data/facilities_1000ormore.csv',index=False)

In the next chapter we will look into advanced Pandas concepts such as Grouping and Joining. 