Hands-on Activity 8.1: Aggregating Data with Pandas
8.1.1 Intended Learning Outcomes
After this activity, the student should be able to:

demonstrate querying and merging of dataframes

Perform advanced calculations on dataframes

Aggregate dataframes with pandas and numpy

Work with time series data

8.1.2 Resources
Computing Environment using Python 3.x
Attached Datasets (under Instructional Materials)
8.1.3 Procedures
The procedures can be found in the canvas module. Check the following under topics:

8.1 Weather Data Collection Githublink for 8.1
8.2 Querying and Merging Githublink for 8.2
8.3 Dataframe Operations Githublink for 8.3
8.4 Aggregations Githublink for 8.4
8.5 Time Series Githublink for 8.5
8.1.4 Data Analysis
Provide some comments here about the results of the procedures.

Storing Data in SQLite: I learned how to store a Pandas DataFrame in an SQLite database. It felt really empowering because now I know how to save data in a way that's easy to retrieve and manage later.

Data Manipulation Language (DML): I practiced using DML to update and organize data within the database. It was a little tricky at first, but I got the hang of it. It's definitely going to help when I need to adjust or clean up data for analysis.

Aggregating Data: I learned was how to use Pandas' aggregate functions to calculate things like the mean, minimum, and maximum values all at once.

Grouping Data: I also got to work with the groupby function, which groups data by a category and lets me apply different calculations to each group. For example, if I have weather data from different stations, I can group the data by station and easily get the average temperature or total rainfall.

Handling Time-Based Data: The real challenge came when I had to work with data that included dates and times (not just dates). I used functions like between_time and at_time to pull out data for specific hours or time ranges. This was a big step up because it let me work with hourly weather data instead of just daily data.

Storing, Manipulating, and Analyzing Data: Overall, I feel so much more confident in my ability to store, manipulate, and analyze data now. Learning how to work with time-based data, in particular, has been a game-changer.

Challenges with Newer Pandas Versions: But, of course, it wasn’t all smooth sailing. Some of the code I was using before didn’t work with the latest version of Pandas. I got a warning about deprecated functions like last() and first(), which meant I had to figure out how to replace those with newer methods. It was a bit frustrating, but I eventually found the alternatives and it was a good reminder to stay updated with library changes!

8.1.5 Supplementary Activity
Using the CSV files provided and what we have learned so far in this module complete the following exercises:

With the earthquakes.csv file, select all the earthquakes in Japan with a magType of mb and a magnitude of 4.9 or greater.
Create bins for each full number of magnitude (for example, the first bin is 0-1, the second is 1-2, and so on) with a magType of ml and count how many are in each bin.
Using the faang.csv file, group by the ticker and resample to monthly frequency. Make the following aggregations:
Mean of the opening price
Maximum of the high price
Minimum of the low price
Mean of the closing price
Sum of the volume traded
Build a crosstab with the earthquake data between the tsunami column and the magType column. Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magType along the columns.

Calculate the rolling 60-day aggregations of OHLC data by ticker for the FAANG data. Use the same aggregations as exercise no. 3.

Create a pivot table of the FAANG data that compares the stocks. Put the ticker in the rows and show the averages of the OHLC and volume traded data.

Calculate the Z-scores for each numeric column of Netflix's data (ticker is NFLX) using apply().

Add event descriptions: Create a dataframe with the following three columns: ticker, date, and event. The columns should have the following values: ticker: 'FB' date: ['2018-07-25', '2018-03-19', '2018-03-20'] event: ['Disappointing user growth announced after close.', 'Cambridge Analytica story', 'FTC investigation'] Set the index to ['date', 'ticker'] Merge this data with the FAANG data using an outer join

Use the transform() method on the FAANG data to represent all the values in terms of the first date in the data. To do so, divide all the values for each ticker by the value

In [2]:
import pandas as pd

earthquake = pd.read_csv('earthquakes.csv')
faang = pd.read_csv('faang.csv')

In [3]:
earthquake.head(5)

Unnamed: 0,mag,magType,time,place,tsunami,parsed_place
0,1.35,ml,1539475168010,"9km NE of Aguanga, CA",0,California
1,1.29,ml,1539475129610,"9km NE of Aguanga, CA",0,California
2,3.42,ml,1539475062610,"8km NE of Aguanga, CA",0,California
3,0.44,ml,1539474978070,"9km NE of Aguanga, CA",0,California
4,2.16,md,1539474716050,"10km NW of Avenal, CA",0,California


In [5]:
earthquake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9332 entries, 0 to 9331
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mag           9331 non-null   float64
 1   magType       9331 non-null   object 
 2   time          9332 non-null   int64  
 3   place         9332 non-null   object 
 4   tsunami       9332 non-null   int64  
 5   parsed_place  9332 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 437.6+ KB


In [6]:
earthquake.rename(columns={
    'mag': 'Magnitude',
    'parsed_place': 'Location',
    'magType': 'MagnitudeType',
    'time' :'Time',
    'tsunami' : 'Tsunami' ,
    'place': 'PlaceDescription'
}, inplace=True)

#changing names of column

In [8]:
FiltEarthquake= earthquake.query("Location == 'Japan' and MagnitudeType == 'mb' and Magnitude >= 4.9")
FiltEarthquake

Unnamed: 0,Magnitude,MagnitudeType,Time,PlaceDescription,Tsunami,Location
1563,4.9,mb,1538977532250,"293km ESE of Iwo Jima, Japan",0,Japan
2576,5.4,mb,1538697528010,"37km E of Tomakomai, Japan",0,Japan
3072,4.9,mb,1538579732490,"15km ENE of Hasaki, Japan",0,Japan
3632,4.9,mb,1538450871260,"53km ESE of Hitachi, Japan",0,Japan


In [22]:
ml = earthquake[earthquake['MagnitudeType'] == 'ml']
bins = [i for i in range(0, int(ml['Magnitude'].max()) + 4)]
counts = pd.cut(ml['Magnitude'], bins=bins, right=False).value_counts().sort_index()

Magml = pd.DataFrame({'Magnitude (ml)':bins[:-1], 'occurrences': counts})
Magml


#make bins for each full number of magnitude

Unnamed: 0_level_0,Magnitude (ml),occurrences
Magnitude,Unnamed: 1_level_1,Unnamed: 2_level_1
"[0, 1)",0,2072
"[1, 2)",1,3126
"[2, 3)",2,985
"[3, 4)",3,153
"[4, 5)",4,6
"[5, 6)",5,2
"[6, 7)",6,0
"[7, 8)",7,0


In [23]:
faang.head()

Unnamed: 0,ticker,date,open,high,low,close,volume
0,FB,2018-01-02,177.68,181.58,177.55,181.42,18151903
1,FB,2018-01-03,181.88,184.78,181.33,184.67,16886563
2,FB,2018-01-04,184.9,186.21,184.0996,184.33,13880896
3,FB,2018-01-05,185.59,186.9,184.93,186.85,13574535
4,FB,2018-01-08,187.2,188.9,186.33,188.28,17994726


In [24]:
faang.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1255 entries, 0 to 1254
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ticker  1255 non-null   object 
 1   date    1255 non-null   object 
 2   open    1255 non-null   float64
 3   high    1255 non-null   float64
 4   low     1255 non-null   float64
 5   close   1255 non-null   float64
 6   volume  1255 non-null   int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 68.8+ KB


In [26]:
faang['date'] = pd.to_datetime(faang['date']) #change dtype to datetime
faang.set_index('date',inplace = True) #index

In [30]:
aggregate = faang.groupby('ticker').resample('ME').agg({
    'open' : 'mean',
    'high' : 'max' ,
    'low'  : 'min',
    'close' : 'mean',
    'volume' : 'sum'
})
aggregate

#aggregate data

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,2018-01-31,170.71469,176.6782,161.5708,170.699271,659679440
AAPL,2018-02-28,164.562753,177.9059,147.9865,164.921884,927894473
AAPL,2018-03-31,172.421381,180.7477,162.466,171.878919,713727447
AAPL,2018-04-30,167.332895,176.2526,158.2207,167.286924,666360147
AAPL,2018-05-31,182.635582,187.9311,162.7911,183.207418,620976206
AAPL,2018-06-30,186.605843,192.0247,178.7056,186.508652,527624365
AAPL,2018-07-31,188.065786,193.765,181.3655,188.179724,393843881
AAPL,2018-08-31,210.460287,227.1001,195.0999,211.477743,700318837
AAPL,2018-09-30,220.611742,227.8939,213.6351,220.356353,678972040
AAPL,2018-10-31,219.489426,231.6645,204.4963,219.137822,789748068


In [32]:
earthquake['Tsunami'].value_counts()

Unnamed: 0_level_0,count
Tsunami,Unnamed: 1_level_1
0,9271
1,61


In [36]:
crosstab_max_mag = pd.crosstab(earthquake['Tsunami'], earthquake['MagnitudeType'],values=earthquake['Magnitude'], aggfunc='max')
crosstab_max_mag

MagnitudeType,mb,mb_lg,md,mh,ml,ms_20,mw,mwb,mwr,mww
Tsunami,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,5.6,3.5,4.11,1.1,4.2,,3.83,5.8,4.8,6.0
1,6.1,,,,5.1,5.7,4.41,,,7.5


In [37]:
agg = faang.groupby('ticker').rolling('60D').agg({
    'open': 'mean',
    'high': 'max',
    'low': 'min',
    'close': 'mean',
    'volume': 'sum'
})
agg

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,2018-01-02,166.927100,169.0264,166.0442,168.987200,25555934.0
AAPL,2018-01-03,168.089600,171.2337,166.0442,168.972500,55073833.0
AAPL,2018-01-04,168.480367,171.2337,166.0442,169.229200,77508430.0
AAPL,2018-01-05,168.896475,172.0381,166.0442,169.840675,101168448.0
AAPL,2018-01-08,169.324680,172.2736,166.0442,170.080040,121736214.0
...,...,...,...,...,...,...
NFLX,2018-12-24,283.509250,332.0499,233.6800,281.931750,525657894.0
NFLX,2018-12-26,281.844500,332.0499,231.2300,280.777750,520444588.0
NFLX,2018-12-27,281.070488,332.0499,231.2300,280.162805,532679805.0
NFLX,2018-12-28,279.916341,332.0499,231.2300,279.461341,521968250.0


In [38]:
table = pd.pivot_table(faang, index='ticker', aggfunc='mean')
table

#create pivot table

Unnamed: 0_level_0,close,high,low,open,volume
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AAPL,186.986218,188.906858,185.135729,187.038674,34021450.0
AMZN,1641.726175,1662.839801,1619.840398,1644.072669,5649563.0
FB,171.510936,173.615298,169.30311,171.454424,27687980.0
GOOG,1113.225139,1125.777649,1101.001594,1113.554104,1742645.0
NFLX,319.290299,325.224583,313.187273,319.620533,11470300.0


In [39]:
netflix = faang[faang['ticker'] == 'NFLX']
def z_score(column):
    return (column - column.mean()) / column.std()
z_scores = netflix.select_dtypes(include='number').apply(z_score)

z_scores

#calculate z score

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,-2.500753,-2.516023,-2.410226,-2.416644,-0.088760
2018-01-03,-2.380291,-2.423180,-2.285793,-2.335286,-0.507606
2018-01-04,-2.296272,-2.406077,-2.234616,-2.323429,-0.959287
2018-01-05,-2.275014,-2.345607,-2.202087,-2.234303,-0.782331
2018-01-08,-2.218934,-2.295113,-2.143759,-2.192192,-1.038531
...,...,...,...,...,...
2018-12-24,-1.571478,-1.518366,-1.627197,-1.745946,-0.339003
2018-12-26,-1.735063,-1.439978,-1.677339,-1.341402,0.517040
2018-12-27,-1.407286,-1.417785,-1.495805,-1.302664,0.134868
2018-12-28,-1.248762,-1.289018,-1.297285,-1.292137,-0.085164


In [40]:
events_data = pd.DataFrame({
    'ticker': ['FB', 'FB', 'FB'],
    'date': ['2018-07-25', '2018-03-19', '2018-03-20'],
    'event': ['Disappointing user growth announced after close.',
              'Cambridge Analytica story',
              'FTC investigation']
})

#adds description

events_data['date'] = pd.to_datetime(events_data['date'])

#merge with faang
data = pd.merge(faang, events_data, on=['date', 'ticker'], how='outer')


data

Unnamed: 0,date,ticker,open,high,low,close,volume,event
0,2018-01-02,AAPL,166.9271,169.0264,166.0442,168.9872,25555934,
1,2018-01-02,AMZN,1172.0000,1190.0000,1170.5100,1189.0100,2694494,
2,2018-01-02,FB,177.6800,181.5800,177.5500,181.4200,18151903,
3,2018-01-02,GOOG,1048.3400,1066.9400,1045.2300,1065.0000,1237564,
4,2018-01-02,NFLX,196.1000,201.6500,195.4200,201.0700,10966889,
...,...,...,...,...,...,...,...,...
1250,2018-12-31,AAPL,157.8529,158.6794,155.8117,157.0663,35003466,
1251,2018-12-31,AMZN,1510.8000,1520.7600,1487.0000,1501.9700,6954507,
1252,2018-12-31,FB,134.4500,134.6400,129.9500,131.0900,24625308,
1253,2018-12-31,GOOG,1050.9600,1052.7000,1023.5900,1035.6100,1493722,


In [41]:
datas = data[data['event'] == 'FTC investigation']
datas

Unnamed: 0,date,ticker,open,high,low,close,volume,event
267,2018-03-20,FB,167.47,170.2,161.95,168.15,129851768,FTC investigation


In [42]:
datas = data[data['event'] == 'Cambridge Analytica story']
datas

Unnamed: 0,date,ticker,open,high,low,close,volume,event
262,2018-03-19,FB,177.01,177.17,170.06,172.56,88140060,Cambridge Analytica story


In [43]:
datas = data[data['event'] == 'Disappointing user growth announced after close.']
datas

Unnamed: 0,date,ticker,open,high,low,close,volume,event
707,2018-07-25,FB,215.715,218.62,214.27,217.5,64592585,Disappointing user growth announced after close.


In [44]:
datb = faang.groupby('ticker').transform(lambda x: x / x.iloc[0])

datb

#use transform method

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,1.000000,1.000000,1.000000,1.000000,1.000000
2018-01-03,1.023638,1.017623,1.021290,1.017914,0.930292
2018-01-04,1.040635,1.025498,1.036889,1.016040,0.764707
2018-01-05,1.044518,1.029298,1.041566,1.029931,0.747830
2018-01-08,1.053579,1.040313,1.049451,1.037813,0.991341
...,...,...,...,...,...
2018-12-24,0.928993,0.940578,0.928131,0.916638,1.285047
2018-12-26,0.943406,0.974750,0.940463,0.976019,1.917695
2018-12-27,0.970248,0.978396,0.953857,0.980169,1.704782
2018-12-28,1.001221,0.989334,0.988395,0.973784,1.142383
