Hands-On Data Analysis with Pandas, Stefanie Molin
# Chapter 3 Exercises
8/04/23  
Data: 

In [2]:
import pandas as pd

#### 1. We want to look at data for FAANG stocks, but were given each as a separate .csv file. Combine them into a single file and store the df of the FAANG data as `faang` for the rest of the exercises. 
- Read in the five csv files
- Add a column to each df called `ticker` populated with the stock's symbol for easy lookup.
- Append them into a single df
- save as a .csv falled `faang.csv`

In [3]:
faang = pd.DataFrame()
for file in ['aapl', 'amzn', 'fb', 'goog', 'nflx']:
    # read in the file
    df = pd.read_csv(f'exercises/{file}.csv')
    # add ticker column
    df.insert(0, 'ticker', file.upper()) # better than df['ticker'] = 'aapl', can specify location and use method
    # append to faang
    faang = pd.concat([faang,df], axis=0)

In [4]:
faang.head()

Unnamed: 0,ticker,date,high,low,open,close,volume
0,AAPL,2018-01-02,43.075001,42.314999,42.540001,43.064999,102223600.0
1,AAPL,2018-01-03,43.637501,42.990002,43.1325,43.057499,118071600.0
2,AAPL,2018-01-04,43.3675,43.02,43.134998,43.2575,89738400.0
3,AAPL,2018-01-05,43.842499,43.262501,43.360001,43.75,94640000.0
4,AAPL,2018-01-08,43.9025,43.482498,43.587502,43.587502,82271200.0


In [5]:
faang.ticker.value_counts()

AAPL    251
AMZN    251
FB      251
GOOG    251
NFLX    251
Name: ticker, dtype: int64

In [6]:
faang.to_csv('data/faang.csv', index=False)

In [7]:
faang.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1255 entries, 0 to 250
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ticker  1255 non-null   object 
 1   date    1255 non-null   object 
 2   high    1255 non-null   float64
 3   low     1255 non-null   float64
 4   open    1255 non-null   float64
 5   close   1255 non-null   float64
 6   volume  1255 non-null   float64
dtypes: float64(5), object(2)
memory usage: 78.4+ KB


Combination of the csv files into one dataframe looks good. Needs type conversion now (date columns is a object type.)

#### 2. With faang, use type conversion to 
- cast the values of the `date` column into datetimes
- cast the values of the volume column into integers  

Then sort by date and ticker.

In [8]:
# use the assign method to do all your work at once
faang = faang.assign(
    date=pd.to_datetime(faang.date),
    volume=faang['volume'].astype(int)
).sort_values(['date','ticker'])

In [9]:
faang.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1255 entries, 0 to 250
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   ticker  1255 non-null   object        
 1   date    1255 non-null   datetime64[ns]
 2   high    1255 non-null   float64       
 3   low     1255 non-null   float64       
 4   open    1255 non-null   float64       
 5   close   1255 non-null   float64       
 6   volume  1255 non-null   int32         
dtypes: datetime64[ns](1), float64(4), int32(1), object(1)
memory usage: 73.5+ KB


#### 3. Find the seven rows in `faang` with the lowest value for volume

In [10]:
faang[:5]

Unnamed: 0,ticker,date,high,low,open,close,volume
0,AAPL,2018-01-02,43.075001,42.314999,42.540001,43.064999,102223600
0,AMZN,2018-01-02,1190.0,1170.51001,1172.0,1189.01001,2694500
0,FB,2018-01-02,181.580002,177.550003,177.679993,181.419998,18151900
0,GOOG,2018-01-02,1066.939941,1045.22998,1048.339966,1065.0,1237600
0,NFLX,2018-01-02,201.649994,195.419998,196.100006,201.070007,10966900


In [11]:
faang.volume.describe()

count    1.255000e+03
mean     3.651989e+07
std      5.763399e+07
min      6.790000e+05
25%      3.960050e+06
50%      1.098100e+07
75%      3.069630e+07
max      3.849868e+08
Name: volume, dtype: float64

In [12]:
faang.sort_values(['volume'])[:7]

Unnamed: 0,ticker,date,high,low,open,close,volume
126,GOOG,2018-07-03,1135.819946,1100.02002,1135.819946,1102.890015,679000
226,GOOG,2018-11-23,1037.589966,1022.398987,1030.0,1023.880005,691500
99,GOOG,2018-05-24,1080.469971,1066.150024,1079.0,1079.23999,766800
130,GOOG,2018-07-10,1159.589966,1149.589966,1156.97998,1152.839966,798400
152,GOOG,2018-08-09,1255.541992,1246.01001,1249.900024,1249.099976,848600
159,GOOG,2018-08-20,1211.0,1194.625977,1205.02002,1207.77002,870800
161,GOOG,2018-08-22,1211.839966,1199.0,1200.0,1207.329956,887400


In [13]:
# alternatively
faang.nsmallest(7,'volume')

Unnamed: 0,ticker,date,high,low,open,close,volume
126,GOOG,2018-07-03,1135.819946,1100.02002,1135.819946,1102.890015,679000
226,GOOG,2018-11-23,1037.589966,1022.398987,1030.0,1023.880005,691500
99,GOOG,2018-05-24,1080.469971,1066.150024,1079.0,1079.23999,766800
130,GOOG,2018-07-10,1159.589966,1149.589966,1156.97998,1152.839966,798400
152,GOOG,2018-08-09,1255.541992,1246.01001,1249.900024,1249.099976,848600
159,GOOG,2018-08-20,1211.0,1194.625977,1205.02002,1207.77002,870800
161,GOOG,2018-08-22,1211.839966,1199.0,1200.0,1207.329956,887400


#### 4. Currently the data is somewhere between long and wide format. Make it completely long formatted.

In [14]:
faang.columns[2:]

Index(['high', 'low', 'open', 'close', 'volume'], dtype='object')

In [15]:
melted_faang = faang.melt(
    id_vars=['ticker', 'date'], value_vars=faang.columns[2:]
    ,value_name='measurement', var_name='metric'
)

In [16]:
melted_faang[:5]

Unnamed: 0,ticker,date,metric,measurement
0,AAPL,2018-01-02,high,43.075001
1,AMZN,2018-01-02,high,1190.0
2,FB,2018-01-02,high,181.580002
3,GOOG,2018-01-02,high,1066.939941
4,NFLX,2018-01-02,high,201.649994


#### 5. Suppose we found that on 7/26/2018 there was a glitch in how the data was recorded. How should we handle this?
- It's a single day, so just get the data for that day and insert it over the incorrect or missing values.
- If research reveals no significant movement for the stocks on that day, another solution would be to remove the incorrect values for this day and impute with the average values of the previous and next opening. 

 ------

#### 6. The European Centre for Disease Prevention and Control (ECDC) provides an open dataset on COVID-19 cases, "daily number of new reported case of COVID-19 by country worldwise). The dataset is updated daily, but we have a snapshot of data from 1/1/2020 to 9/18/2020. Clean and pivot the data to wide format.
- Read in the covid19_cases.csv file
- Create a date column using the data in the dateRep column
- Set the date col as the index and sort the index
- Replace all occurrences of `United_States_of_America` and `Unitied_Kingdom` with `USA` and `UK` (`.replace()`)
- Using `countriesAndTerritories` filter thje cleaned COVID-19 cases data down to Argentina, Brazil, China, Colombia, India, Italy, Mexico, Peru, Russia, Spain, Turkey, UK, USA
- Pivot the data so the index contains the dates, the cols contain the country names, and the values are the case counts (cases col). Fill NaNs with 0.

In [51]:
# Read in the data
covid = pd.read_csv('exercises/covid19_cases.csv')
covid[:5]

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0,01/01/2020,1,1,2020,0,0,Lithuania,LT,LTU,2794184.0,Europe,
1,01/01/2020,1,1,2020,0,0,Iceland,IS,ISL,356991.0,Europe,
2,01/01/2020,1,1,2020,0,0,Nepal,NP,NPL,28608715.0,Asia,
3,01/01/2020,1,1,2020,0,0,San_Marino,SM,SMR,34453.0,Europe,
4,01/01/2020,1,1,2020,0,0,Canada,CA,CAN,37411038.0,America,


In [52]:
# Create a date format column using the data in dateRep
covid['date'] = pd.to_datetime(covid['dateRep'], format='%d/%m/%Y')

In [53]:
# Set this column as an index
covid.set_index('date', inplace=True)

In [54]:
# Replace all occurrences of United_States_of_America and Unitied_Kingdom with USA and UK
covid = covid.replace(to_replace=['United_States_of_America', 'United_Kingdom'],
              value=['USA','UK']).sort_index()

In [55]:
keep = ['Argentina', 'Brazil', 'China', 'Colombia', 'India', 'Italy', 
        'Mexico', 'Peru', 'Russia', 'Spain', 'Turkey', 'UK', 'USA']
mask = covid.countriesAndTerritories.isin(keep)
covid = covid[mask]


In [25]:
# Pivot the data (convert to wide format) so
#     index = dates
#     columns = country names
#     values = case counts from `cases` column
# Fill NaNs with 0

In [56]:
covid = covid.pivot(
    columns='countriesAndTerritories', values='cases'
).fillna(0)

In [57]:
covid

countriesAndTerritories,Argentina,Brazil,China,Colombia,India,Italy,Mexico,Peru,Russia,Spain,Turkey,UK,USA
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-03,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-05,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-09-14,10778.0,14768.0,29.0,7355.0,92071.0,1456.0,4408.0,6787.0,5449.0,27404.0,1527.0,3330.0,33871.0
2020-09-15,9056.0,15155.0,22.0,5573.0,83809.0,1008.0,3335.0,4241.0,5509.0,9437.0,1716.0,2621.0,34841.0
2020-09-16,9908.0,36653.0,24.0,6698.0,90123.0,1229.0,4771.0,4160.0,5529.0,11193.0,1742.0,3103.0,51473.0
2020-09-17,11893.0,36820.0,7.0,7787.0,97894.0,1452.0,4444.0,6380.0,5670.0,11291.0,1771.0,3991.0,24598.0


This matched the solution. Now here's how you'd do it all at once like a maniac:

In [58]:
covid = pd.read_csv('exercises/covid19_cases.csv').assign(
    date=lambda x: pd.to_datetime(x.dateRep, format='%d/%m/%Y')
).set_index('date').replace(to_replace=['United_States_of_America', 'United_Kingdom'],
              value=['USA','UK']).sort_index()

covid[
    covid.countriesAndTerritories.isin(['Argentina', 'Brazil', 'China', 'Colombia', 'India', 'Italy', 
        'Mexico', 'Peru', 'Russia', 'Spain', 'Turkey', 'UK', 'USA'
    ])
].reset_index().pivot(index='date', columns='countriesAndTerritories', values='cases').fillna(0)

countriesAndTerritories,Argentina,Brazil,China,Colombia,India,Italy,Mexico,Peru,Russia,Spain,Turkey,UK,USA
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-03,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-05,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-09-14,10778.0,14768.0,29.0,7355.0,92071.0,1456.0,4408.0,6787.0,5449.0,27404.0,1527.0,3330.0,33871.0
2020-09-15,9056.0,15155.0,22.0,5573.0,83809.0,1008.0,3335.0,4241.0,5509.0,9437.0,1716.0,2621.0,34841.0
2020-09-16,9908.0,36653.0,24.0,6698.0,90123.0,1229.0,4771.0,4160.0,5529.0,11193.0,1742.0,3103.0,51473.0
2020-09-17,11893.0,36820.0,7.0,7787.0,97894.0,1452.0,4444.0,6380.0,5670.0,11291.0,1771.0,3991.0,24598.0


#### 7. `covid19_total_cases.csv` contains aggregated data, showing the total number of cases per country. Use it to find the 20 countries with the largest COVID-19 case totals. 
When reading the CSV, pass in index_col='cases' and note it will be helpful to transpose the data before isolating the countries. 

In [77]:
df = pd.read_csv('exercises/covid19_total_cases.csv', index_col='index')
df.T.nlargest(20,'cases').sort_values('cases', ascending=False)


index,cases
USA,6724667
India,5308014
Brazil,4495183
Russia,1091186
Peru,756412
Colombia,750471
Mexico,688954
South_Africa,657627
Spain,640040
Argentina,601700


In [None]:
df.set_index('index