#CSI4142-A Fundamentals of Data Science

Group 48

Data mining: data transformations for Economy dimension

Starting with the economy dimension (economy.csv), we will perform the following transformations: <br>
1. The "GDP growth rate" column contains some zero values as placeholders. For each of the zero values, we wish to replace it with the mean value of the industry sector. That is:<br>
a. For each row with a zero (0.0) value in the GDP growth rate column, we get the corresponding value in the "Sector column.<br>
b. Find the mean GDP growth rate for the sector (all rows whose "Sector" value matches the row concerned).<br>
c. Fill the zero value in the GDP growth rate with the mean growth rate from the sector.<br>
2. The "Sector" column contains categorical attributes. One-hot-encode this column into the (five) possible values that the column contains: All industries, All transportation, Air transportation, Ground transportation, Water transportation. Replace the "Sector" column with five columns whose names correspond to the 5 values given above.<br>
3. Drop the "year" column.<br>
4. Use a MinMax scaler to normalize the non-key attribute columns. (GDP, GDP per capita, GDP growth rate).<br>


In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the dataset
economy_df = pd.read_csv("https://raw.githubusercontent.com/noobstang/cscsi4142-project-datasets/master/dimension/economy.csv")

# Replace 0.0 values in GDP growth rate with the mean of the sector
for sector in economy_df['Sector'].unique():
    sector_mean = economy_df.loc[(economy_df['Sector'] == sector) & (economy_df['GDP growth rate'] != 0), 'GDP growth rate'].mean()
    economy_df.loc[(economy_df['Sector'] == sector) & (economy_df['GDP growth rate'] == 0), 'GDP growth rate'] = sector_mean


In [2]:
# One-hot-encode the "Sector" column
economy_df = pd.get_dummies(economy_df, columns=['Sector'])


In [3]:
# Drop the "year" column
economy_df.drop('year', axis=1, inplace=True)


In [4]:
# Columns to scale
columns_to_scale = ['GDP', 'GDP per capita', 'GDP growth rate']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Scale the columns
economy_df[columns_to_scale] = scaler.fit_transform(economy_df[columns_to_scale])

# Check the transformed dataframe
print(economy_df.head())


   Economy_key  Location_key  Date_key       GDP  GDP per capita  \
0            1            76      2558  0.000235        0.002230   
1            2            77      2923  0.000246        0.002278   
2            3            78      3288  0.000252        0.002292   
3            4            79      3653  0.000260        0.002318   
4            5            80      4019  0.000233        0.002040   

   GDP growth rate  Sector_Air transportation  Sector_All industries  \
0         0.274288                       True                  False   
1         0.264410                       True                  False   
2         0.257774                       True                  False   
3         0.259111                       True                  False   
4         0.218176                       True                  False   

   Sector_All transportation  Sector_Ground transportation  \
0                      False                         False   
1                      False      

In [6]:
economy_df.tail()

Unnamed: 0,Economy_key,Location_key,Date_key,GDP,GDP per capita,GDP growth rate,Sector_Air transportation,Sector_All industries,Sector_All transportation,Sector_Ground transportation,Sector_Water Transportation
645,646,165,10228,0.000138,0.00044,0.291989,False,False,False,False,True
646,647,166,10593,0.000159,0.000501,0.297169,False,False,False,False,True
647,648,167,10958,0.000187,0.000585,0.304237,False,False,False,False,True
648,649,168,11324,0.000204,0.000637,0.278163,False,False,False,False,True
649,650,169,11689,0.000229,0.000706,0.287168,False,False,False,False,True
