# Data Science Lesson - Getting Data
---
### Reading from CSV files
A comma-separated values (CSV) file is a very common, generic file format used for data storage and transfer.
There is the "vanilla" Python way to get data out of a CSV, and there is the pandas way.
See https://realpython.com/python-csv/ to determine which you prefer. :)

Open your Google Sheets file of people data from the last lesson. Use File > Download to get a CSV locally and place it in the same directory as this notebook. Rename it "people.csv".

Now, import pandas as pd and use the .read_csv function to read the contents of people.csv into a pandas dataframe. Output the dataframe to see what it looks like.

In [13]:
import pandas as pd

content = pd.read_csv('/workspaces/data-science-2024F/unit2/people.csv')
print(content)

      name   gpa  friends  height       sport    footwear
0   dakota  3.15      307      72  basketball    sneakers
1   hayden  3.10      335      68      tennis  flip-flops
2  charlie  1.10       34      61    baseball  flip-flops
3   kamryn  2.18      200      66      soccer    sneakers
4  emerson  3.06      213      65      soccer    sneakers
5   jessie  2.41      202      61  basketball  flip-flops
6   sawyer  2.96      314      67      tennis  flip-flops
7   london  3.98      436      64      soccer    sneakers


**Question:** What's a DataFrame?

**Answer:** It's our new best friend!

Now that you have data in a pandas dataframe, use .info() to get details about the columns.

In [14]:
content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      8 non-null      object 
 1   gpa       8 non-null      float64
 2   friends   8 non-null      int64  
 3   height    8 non-null      int64  
 4   sport     8 non-null      object 
 5   footwear  8 non-null      object 
dtypes: float64(1), int64(2), object(3)
memory usage: 516.0+ bytes


And now use .describe() to get a lot more useful informations.

In [15]:
content.describe()

Unnamed: 0,gpa,friends,height
count,8.0,8.0,8.0
mean,2.7425,255.125,65.5
std,0.853409,120.58481,3.664502
min,1.1,34.0,61.0
25%,2.3525,201.5,63.25
50%,3.01,260.0,65.5
75%,3.1125,319.25,67.25
max,3.98,436.0,72.0


## Weather Data
From https://github.com/fivethirtyeight/data/tree/master/us-weather-history we can get weather data as CSV files from many different airports.

Download a CSV file from the above site. (Make sure to pick one that no one else chooses.) You'll need to view the "raw" page and save the file locally as a .csv (not .txt)

Then, read the file into a dataframe and output it to verify.

In [17]:
weather = pd.read_csv('/workspaces/data-science-2024F/unit2/KNYC.csv')
print(weather)

          date  actual_mean_temp  actual_min_temp  actual_max_temp  \
0     2014-7-1                81               72               89   
1     2014-7-2                82               72               91   
2     2014-7-3                78               69               87   
3     2014-7-4                70               65               74   
4     2014-7-5                72               63               81   
..         ...               ...              ...              ...   
360  2015-6-26                75               69               81   
361  2015-6-27                65               58               71   
362  2015-6-28                68               62               73   
363  2015-6-29                70               63               76   
364  2015-6-30                75               68               82   

     average_min_temp  average_max_temp  record_min_temp  record_max_temp  \
0                  68                83               52              100   
1    

Get the info for the data set.

In [18]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   365 non-null    object 
 1   actual_mean_temp       365 non-null    int64  
 2   actual_min_temp        365 non-null    int64  
 3   actual_max_temp        365 non-null    int64  
 4   average_min_temp       365 non-null    int64  
 5   average_max_temp       365 non-null    int64  
 6   record_min_temp        365 non-null    int64  
 7   record_max_temp        365 non-null    int64  
 8   record_min_temp_year   365 non-null    int64  
 9   record_max_temp_year   365 non-null    int64  
 10  actual_precipitation   365 non-null    float64
 11  average_precipitation  365 non-null    float64
 12  record_precipitation   365 non-null    float64
dtypes: float64(3), int64(9), object(1)
memory usage: 37.2+ KB


And now describe the data.

In [19]:
weather.describe()

Unnamed: 0,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
count,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0
mean,54.736986,47.246575,61.734247,48.016438,62.079452,28.243836,83.731507,1924.079452,1958.923288,0.126164,0.136822,2.386137
std,18.679979,18.277156,19.446971,14.749176,16.068765,20.729107,13.351306,38.00082,34.824357,0.325577,0.015734,1.045702
min,11.0,2.0,19.0,27.0,38.0,-15.0,54.0,1871.0,1876.0,0.0,0.1,0.86
25%,39.0,34.0,44.0,34.0,47.0,8.0,71.0,1888.0,1933.0,0.0,0.13,1.69
50%,58.0,50.0,65.0,48.0,63.0,31.0,87.0,1920.0,1962.0,0.0,0.14,2.16
75%,72.0,64.0,80.0,63.0,78.0,47.0,96.0,1954.0,1990.0,0.05,0.15,2.75
max,85.0,77.0,92.0,69.0,84.0,59.0,106.0,2015.0,2013.0,2.54,0.17,8.28


Take a look at the mean and std of the actual_mean_temp. That's the average temperature of the airport over the whole year, and the standard deviation. Compare with others to see if you can tell whose airports are more temperate and more volatile. Then look up the airport by its code and see if your observations make sense.

## Reading directly from the web
We can also get CSV directly from the web without saving the file locally. Note that this creates a dependency on the host of data. If that resource is moved (or removed) our script will stop functioning.

Try getting data on surnames from here: https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv

Then, investigate the data using pandas tools we just learned.

In [21]:
surnames_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv"
surnames = pd.read_csv(surnames_url)

print(surnames)
surnames.info()
surnames.describe()

             name    rank    count  prop100k  cum_prop100k pctwhite pctblack  \
0           SMITH       1  2376206    880.85        880.85    73.35    22.22   
1         JOHNSON       2  1857160    688.44       1569.30    61.55     33.8   
2        WILLIAMS       3  1534042    568.66       2137.96    48.52    46.72   
3           BROWN       4  1380145    511.62       2649.58    60.71    34.54   
4           JONES       5  1362755    505.17       3154.75    57.69    37.73   
...           ...     ...      ...       ...           ...      ...      ...   
151666     YOUSKO  150436      100      0.04      89752.93       99      (S)   
151667    ZAITSEV  150436      100      0.04      89753.04       92      (S)   
151668      ZALLA  150436      100      0.04      89753.11       99      (S)   
151669     ZERBEY  150436      100      0.04      89753.30       99      (S)   
151670  ZITTERICH  150436      100      0.04      89753.48       98      (S)   

       pctapi pctaian pct2prace pcthisp

Unnamed: 0,rank,count,prop100k,cum_prop100k
count,151671.0,151671.0,151671.0,151671.0
mean,75649.497781,1596.357,0.591744,82520.575351
std,43614.414271,16338.75,6.056723,8902.405422
min,1.0,100.0,0.04,880.85
25%,37881.0,143.0,0.05,80519.25
50%,75695.0,237.0,0.09,85509.67
75%,113519.0,551.0,0.2,88079.475
max,150436.0,2376206.0,880.85,89753.56
