# Data Preprocessing 

In this notebook we clean and prepare the data for analysis.

In [80]:
import os
import numpy as np
import pandas as pd
from itertools import chain

## Cleaning Automotive Data

### Loading Data and selecting correct columns and rows

After an inspection of the files we can see that all the data starts at row 4, this means we can load them in the same way, skipping the first tree rows.

In [15]:
# Geting Data paths
raw_path = os.path.join(os.pardir, 'data', 'raw', 'automotive')
path_18 = os.path.join(raw_path, f'oica_stats_2018.xlsx')
path_19 = os.path.join(raw_path, f'oica_stats_2019.xlsx')
path_20 = os.path.join(raw_path, f'oica_stats_2020.xlsx')

# loading data, Skipping first 3 rows since they have no data 
pdf_18 = pd.read_excel(path_18, skiprows=3)
pdf_19 = pd.read_excel(path_19, skiprows=3)
pdf_20 = pd.read_excel(path_20, skiprows=3)

Inspecting the data frames 

In [26]:
pdf_18.head(10)

Unnamed: 0.1,Unnamed: 0,UNITS,YTD 2017,YTD 2018,Unnamed: 5.1,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,,CARS,Q1-Q4,Q1-Q4,VARIATION,,,,,,,,,
1,,EUROPE,19026293.893204,18696196,-0.01735,,,,,,,,,
2,,- EUROPEAN UNION 27 countries,16598458.893204,16060576,-0.032406,,,,,,,,,
3,,- EUROPEAN UNION 15 countries,12773507.893204,12056800,-0.056109,,,,,,,,,
4,,Double Counts Austria / Germany,,,,,,,,,,,,
5,,Double Counts Austria / Japan,,,,,,,,,,,,
6,,Double Counts Belgium / Germany,,,,,,,,,,,,
7,,Double Counts Italy / Germany,,,,,,,,,,,,
8,,Double Counts Portugal / World,,,,,,,,,,,,
9,,AUSTRIA,100398.058252,103410,0.27,,,,,,,,,


In [25]:
pdf_19.head(10)

Unnamed: 0.1,Unnamed: 0,UNITS,2018,2019,Unnamed: 5
0,,CARS,Q1-Q4,Q1-Q4,VARIATION
1,,EUROPE,19660923,18722527,-0.047729
2,,- EUROPEAN UNION 27 countries,16746049,15837082,-0.054279
3,,- EUROPEAN UNION 15 countries,12614691,11687147,-0.073529
4,,Double Counts Austria / Germany,,,
5,,Double Counts Austria / Japan,,,
6,,Double Counts Belgium / Germany,,,
7,,Double Counts Italy / Germany,,,
8,,Double Counts Portugal / World,,,
9,,AUSTRIA,144500,158400,0.096194


In [53]:
pdf_20.head(10)

Unnamed: 0,UNITS,YTD 2019,YTD 2020,VARIATION
0,CARS,Q1-Q4,Q1-Q4,
1,EUROPE,18724208,14545984.928,-0.223146
2,- EUROPEAN UNION 28 countries,15838743,12034836.928,-0.240165
3,- EUROPEAN UNION 15 countries,11680894,8631718,-0.26104
4,AUSTRIA,158400,104544.0,-0.34
5,BELGIUM,247020,237057,-0.040333
6,FINLAND,114785,86270,-0.248421
7,FRANCE,1665787,927718,-0.443075
8,GERMANY,4663749,3515372,-0.246235
9,ITALY,542472,451826,-0.167098


We are not interested in any of the unnamed cols so we will drop all them all (note this is also the same for all files)

In [29]:
automotive_data = [pdf_18, pdf_19, pdf_20]

for i, pdf in enumerate(automotive_data):
    cols_to_keep = [col for col in pdf.columns if not str(col).startswith('Unnamed')]
    automotive_data[i] = pdf[cols_to_keep]

automotive_data[0].head(5)

Unnamed: 0,UNITS,YTD 2017,YTD 2018,Unnamed: 4
0,CARS,Q1-Q4,Q1-Q4,VARIATION
1,EUROPE,19026293.893204,18696196,-0.01735
2,- EUROPEAN UNION 27 countries,16598458.893204,16060576,-0.032406
3,- EUROPEAN UNION 15 countries,12773507.893204,12056800,-0.056109
4,Double Counts Austria / Germany,,,


The first row should be part of the header so we will fix that. We want to keep the names of the first 3 cols and the $4^{th}$ one from the $1^{st}$ row. Note for 2020 we will just drop the first row

In [54]:
# we don't want to do this for the 2020 data 
for i, pdf in enumerate(automotive_data[:2]):
    pdf.columns = pdf.columns[:3].tolist() + pdf.iloc[0, 3:].to_list()
    automotive_data[i] = pdf.iloc[1:]

# for 2020 we just drop the first row 
automotive_data[2] = automotive_data[2].iloc[1:]

Now we will look at the bottom of the files 

In [58]:
automotive_data[0].tail(20)

Unnamed: 0,UNITS,YTD 2017,YTD 2018,VARIATION
84,ZIMBABWE,,,
85,OTHERS,,,
86,TOTAL,72663012.893204,70466344.0,-0.030231
87,,,,
88,"Note: Audi, BMW, JLR, Mercedes, Scania and Dai...",,,
89,Estimate,,,
90,,,,
91,,,,
92,,,,
93,,,,


In [60]:
automotive_data[1].tail(20)

Unnamed: 0,UNITS,2018,2019,VARIATION
84,ZIMBABWE,,,
85,OTHERS,,,
86,TOTAL,71750946.0,67149196.0,-0.064135
87,,,,
88,"Note: Audi, BMW, JLR, Mercedes, Scania and Dai...",,,
89,Estimate,,,
90,,,,
91,,,,
92,,,,
93,,,,


In [61]:
automotive_data[2].tail(20)

Unnamed: 0,UNITS,YTD 2019,YTD 2020,VARIATION
46,IRAN,770000.0,826210.0,0.073
47,JAPAN,8329130.0,6960025.0,-0.164376
48,MALAYSIA,534115.0,457755.0,-0.142965
49,"MYANMAR, yearly only",12617.0,8346.0,-0.338512
50,PAKISTAN,156623.0,95504.0,-0.39023
51,PHILIPPINES,57238.0,37141.0,-0.351113
52,SOUTH KOREA,3612587.0,3211706.0,-0.110968
53,TAIWAN,189549.0,180967.0,-0.045276
54,THAILAND,795254.0,537633.0,-0.323948
55,"VIETNAM, yearly only",129006.0,125235.0,-0.029231


We can see mainly NaNs, and some notes, we also have some content based aggregates, we will remove these for now. The easiest way to do this will be by dropping any row that has a NaN on the $3^{rd}$ col, since that represents the year of interest for each respective data frame.

In [68]:
for i, pdf in enumerate(automotive_data):
    automotive_data[i] = automotive_data[i].dropna(subset=[pdf.columns[2]])

In [69]:
automotive_data[0]

Unnamed: 0,UNITS,YTD 2017,YTD 2018,VARIATION
1,EUROPE,19026293.893204,18696196,-0.01735
2,- EUROPEAN UNION 27 countries,16598458.893204,16060576,-0.032406
3,- EUROPEAN UNION 15 countries,12773507.893204,12056800,-0.056109
9,AUSTRIA,100398.058252,103410,0.27
10,BELGIUM,332979,265958,-0.201277
11,FINLAND,108838.834951,112104,0.22
12,FRANCE,1754000,1763000,0.005131
13,GERMANY,5645584,5120409,-0.093024
14,ITALY,742642,670932,-0.096561
15,"NETHERLANDS *** AS OF 2013, FIGURES ONCE A YE...",publication stopped,publication stopped,


This simplifies our dataset but for this study we want to simplify it further by splitting the aggregated data and the country data

In [83]:
# we will start by looking at all the unique values for the 'Units' column 
ls = [set(pdf['UNITS'].values) for pdf in automotive_data]
unique_units = sorted(list(set(chain.from_iterable(ls))))
print(unique_units)

[' - EUROPEAN UNION 15 countries', ' - EUROPEAN UNION 27 countries', ' - EUROPEAN UNION 28 countries', ' - EUROPEAN UNION New Members', ' - NAFTA', ' - OTHER EUROPE', ' - SOUTH AMERICA', ' EUROPE', 'AFRICA', 'ALGERIA', 'AMERICA', 'ARGENTINA', 'ASIA-OCEANIA', 'AUSTRALIA', 'AUSTRIA', 'AZERBAIJAN', 'BELARUS', 'BELGIUM', 'BRAZIL', 'CANADA', 'CANADA  ', 'CHINA', 'CIS', 'COLOMBIA', 'CZECH REPUBLIC', 'CZECH REPUBLIC ', 'Double Counts Asia / World', 'Double Counts CIS / World', 'Double Counts South Africa / World', 'Double counts South America / World', 'EGYPT', 'EGYPT, yearly only', 'FINLAND', 'FRANCE', 'GERMANY', 'HUNGARY', 'INDIA', 'INDONESIA', 'IRAN', 'ITALY', 'JAPAN', 'KAZAKHSTAN', 'MALAYSIA', 'MEXICO', 'MOROCCO', 'MYANMAR, yearly only', 'NETHERLANDS *** AS OF 2013,  FIGURES ONCE A YEAR ONLY', 'NETHERLANDS,  FIGURES ONCE A YEAR ONLY', 'PAKISTAN', 'PAKISTAN ', 'PHILIPPINES', 'POLAND', 'PORTUGAL', 'ROMANIA', 'RUSSIA', 'SERBIA', 'SLOVAKIA', 'SLOVENIA', 'SOUTH AFRICA', 'SOUTH KOREA', 'SPAIN',

From this we can see that there are a few entries we can remove, such as those starting w/ a '-' since they are trade zones, the double count and total lines, we will also remove CIS.

In [100]:
units_to_keep = [utk for utk in unique_units if not utk.startswith(' -') if not utk.startswith('Double') if not utk == 'CIS' ]
print(units_to_keep)

[' EUROPE', 'AFRICA', 'ALGERIA', 'AMERICA', 'ARGENTINA', 'ASIA-OCEANIA', 'AUSTRALIA', 'AUSTRIA', 'AZERBAIJAN', 'BELARUS', 'BELGIUM', 'BRAZIL', 'CANADA', 'CANADA  ', 'CHINA', 'COLOMBIA', 'CZECH REPUBLIC', 'CZECH REPUBLIC ', 'EGYPT', 'EGYPT, yearly only', 'FINLAND', 'FRANCE', 'GERMANY', 'HUNGARY', 'INDIA', 'INDONESIA', 'IRAN', 'ITALY', 'JAPAN', 'KAZAKHSTAN', 'MALAYSIA', 'MEXICO', 'MOROCCO', 'MYANMAR, yearly only', 'NETHERLANDS *** AS OF 2013,  FIGURES ONCE A YEAR ONLY', 'NETHERLANDS,  FIGURES ONCE A YEAR ONLY', 'PAKISTAN', 'PAKISTAN ', 'PHILIPPINES', 'POLAND', 'PORTUGAL', 'ROMANIA', 'RUSSIA', 'SERBIA', 'SLOVAKIA', 'SLOVENIA', 'SOUTH AFRICA', 'SOUTH KOREA', 'SPAIN', 'SWEDEN', 'SWEDEN, FIGURES ONCE A YEAR ONLY', 'TAIWAN', 'TAIWAN ', 'THAILAND', 'TOTAL ', 'TURKEY', 'UKRAINE', 'UNITED KINGDOM', 'UNITED KINGDOM  ', 'USA', 'UZBEKISTAN', 'UZBEKISTAN ', 'VIETNAM', 'VIETNAM (PC+CV in 2019)', 'VIETNAM, yearly only']


Next we need to simplify some of the entries so they only appear once, a clear example is 'VIETNAM', 'VIETNAM (PC+CV in 2019)', 'VIETNAM, yearly only', since we are only looking at yearly data those remarks add no value and can be scrubbed. 

True