# Big Mac Mega CSV
For this part we're going to be working with the *big_mac_source_data_v2.csv*, which is the original dataset. 

We'll be removing columns and removing rows to make the data more manageable and hone in on what we're looking for.

Please note that this compares the data by working with the US dollar at the time of publishing the dataset, not AUD. It's got the local price and dollar exchange.

For these activities please look at *08 - Removing and Splitting Dataframes* and onwards.

In [10]:
import pandas as pd

#Read in the big_mac_source_data_v2.csv into a dataframe and show the first five rows of data (using .head())
big_mac_df = pd.read_csv('data/big_mac_source_data_v2.csv')
big_mac_df.head()
#Now, remove the iso_a3, GDP_dollar and GDP_local columns
big_mac_df.drop(columns=['iso_a3', 'GDP_dollar', 'GDP_local'])

#Print the dataframe
big_mac_df



Unnamed: 0,name,iso_a3,currency_code,local_price,dollar_ex,GDP_dollar,GDP_local,date
0,Argentina,ARG,ARS,2.50,1.00000,8709.072,8.709072e+03,2000-04-01
1,Australia,AUS,AUD,2.59,1.68000,21762.397,3.372292e+04,2000-04-01
2,Brazil,BRA,BRL,2.95,1.79000,3501.438,6.351375e+03,2000-04-01
3,Canada,CAN,CAD,2.85,1.47000,22340.553,3.319147e+04,2000-04-01
4,Switzerland,CHE,CHF,5.90,1.70000,41768.021,6.274204e+04,2000-04-01
...,...,...,...,...,...,...,...,...
2083,Ukraine,UKR,UAH,105.00,36.93145,4348.570,1.406445e+05,2023-07-01
2084,Uruguay,URY,UYU,259.00,37.76500,20221.940,8.324901e+05,2023-07-01
2085,United States,USA,USD,5.58,1.00000,76348.494,7.634849e+04,2023-07-01
2086,Vietnam,VNM,VND,74000.00,23687.50000,4086.519,9.564814e+07,2023-07-01


### Remove Rows
Now we want to remove some rows. We want to just focus on data from the start of the 2023-2024 financial year (01-07-2023).

There are two options for this:

#### Remove Rows by Index
We can remove all rows by index by looking at the index and deleting all rows. 

[Remove Rows By Index](https://scales.arabpsychology.com/stats/how-to-drop-rows-by-index-in-pandas-with-examples/#:~:text=Dropping%20rows%20by%20index%20in,provided%20to%20illustrate%20the%20process)

Look through the above to know how to drop rows by index. However, we need to know how to drop a broad range of rows by index. You'll need this code to find the index `df.loc[x:y].index`, which will go in place of `index=[...]` when calling the drop function.

#### Remove duplicates in a column
We can also remove all duplicates of a column except for the last value, which is probably a little easier. 


[Remove Duplicates and Keep the Latest](https://scales.arabpsychology.com/stats/how-do-i-drop-duplicate-rows-in-pandas-and-keep-the-latest-one/)

You'll want to remove the `sort_values('time')` section, but aside from that it should work if you remember to look at the `name` column as opposed to their example of `item` column. The only issue here is we do end up keeping a couple of countries which stopped reporting data (UAE and Russia). This means you will need to manually remove the first two rows (you can do this by index, look above).

In [11]:
#Remove the data using either method below. If you want to try both, COMMENT OUT ONE OF THEM AND CLEAR ALL OUTPUTS.
big_mac_df = big_mac_df.sort_values('date').drop_duplicates(['name'], keep='last')
big_mac_df = big_mac_df.drop([1242, 1861, 2014])
big_mac_df

Unnamed: 0,name,iso_a3,currency_code,local_price,dollar_ex,GDP_dollar,GDP_local,date
2067,Oman,OMN,OMR,1.42,0.385050,24772.462000,9.525012e+03,2023-07-01
2066,New Zealand,NZL,NZD,8.10,1.606813,47208.355000,7.423399e+04,2023-07-01
2065,Norway,NOR,NOK,70.00,10.116500,106328.405000,1.022259e+06,2023-07-01
2064,Netherlands,NLD,EUR,5.33,0.906990,56489.068000,5.360119e+04,2023-07-01
2063,Nicaragua,NIC,NIO,159.00,36.550000,2386.754000,8.569825e+04,2023-07-01
...,...,...,...,...,...,...,...,...
2037,Euro area,EUZ,EUR,5.28,0.906990,40906.703497,3.881544e+04,2023-07-01
2036,Estonia,EST,EUR,4.45,0.906990,28631.096000,2.716740e+04,2023-07-01
2035,Spain,ESP,EUR,5.20,0.906990,29420.615000,2.791655e+04,2023-07-01
2051,Italy,ITA,EUR,5.80,0.906990,34113.201000,3.236924e+04,2023-07-01


## Create .csv File
Let's finish this off by storing just the 01-07-2023 values into a new .csv file. 

In [12]:
#Write updated DataFrame to .csv file 'big_mac_2023.csv'
big_mac_df.to_csv(
    'Data/big_mac_2023.csv',
    index=False)