#CSI4142-A Fundamentals of Data Science

Group 48

Data mining: data transformations for Weather dimension

For the tourism fact table (tourism_fact_table.csv),  we will perform the following transformations. <br>
1. Drop all rows with an empty value for column "Economy_key". These entries cannot be used for data mining related to economic statistics. <br>
2. Use MinMax scaler to normalize the non-key numeric columns (Total non-resident tourists, United states tourists, Non-US foreign tourists, Canadian tourists returning from U.S., Canadian tourists returning from abroad). <br>
3. Remove and separate the rows with am empty value for the "Weather_key" column into a separate data frame. <br>
4. For the separated dataframe (with empty "Weather_key" values), drop the Weather_key column. <br>

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the dataset
tourism_df = pd.read_csv("https://raw.githubusercontent.com/noobstang/cscsi4142-project-datasets/master/dimension/tourism_fact_table.csv")

# Drop rows with an empty 'Economy_key'
tourism_df.dropna(subset=['Economy_key'], inplace=True)


In [2]:
# Columns to scale
columns_to_scale = ['Total non-resident tourists', 'United states tourists', 'Non-US foreign tourists',
                    'Canadian tourists returning from U.S.', 'Canadian tourists returning from abroad']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Scale the columns
tourism_df[columns_to_scale] = scaler.fit_transform(tourism_df[columns_to_scale])


In [3]:
# Separate rows with empty 'Weather_key'
separated_df = tourism_df[tourism_df['Weather_key'].isna()].copy()

# Remove these rows from the original dataframe
tourism_df.dropna(subset=['Weather_key'], inplace=True)


In [4]:
# Drop 'Weather_key' column from the separated dataframe
separated_df.drop('Weather_key', axis=1, inplace=True)


In [5]:
tourism_df.head()

Unnamed: 0,Date_key,Weather_key,Location_key,Economy_key,Total non-resident tourists,United states tourists,Non-US foreign tourists,Canadian tourists returning from U.S.,Canadian tourists returning from abroad,Seasonally adjusted
864,2558,"[2941, 3349, 3757]",42,27.0,0.049318,0.040315,0.05188,0.061146,0.061607,False
865,2558,"[2941, 3349, 3757]",42,157.0,0.059266,0.048869,0.06113,0.063407,0.045948,False
866,2558,"[2941, 3349, 3757]",42,287.0,0.071036,0.055535,0.081346,0.095211,0.05808,False
867,2558,"[2941, 3349, 3757]",42,417.0,0.074899,0.060824,0.079664,0.076338,0.04544,False
868,2558,"[2941, 3349, 3757]",42,547.0,0.12387,0.103334,0.124125,0.089854,0.038556,False


In [6]:
separated_df.head()

Unnamed: 0,Date_key,Location_key,Economy_key,Total non-resident tourists,United states tourists,Non-US foreign tourists,Canadian tourists returning from U.S.,Canadian tourists returning from abroad,Seasonally adjusted
840,2558,8,53.0,0.185591,0.167687,0.15126,0.31321,0.316182,False
841,2558,8,183.0,0.22149,0.200297,0.179973,0.298478,0.261216,False
842,2558,8,313.0,0.262414,0.234522,0.220619,0.502022,0.335175,False
843,2558,8,443.0,0.275647,0.248415,0.226179,0.417894,0.235661,False
844,2558,8,573.0,0.458446,0.415551,0.369474,0.419975,0.193416,False


    Dropping Rows Missing Economy_key: This ensures the dataset only includes entries relevant to economic analysis.
    Normalization: Specified numerical columns are normalized, making the data suitable for various statistical or machine learning models.
    Handling Rows Missing Weather_key: Entries without weather data are moved to a separate DataFrame and then removed from the original dataset.
    Cleaning Separated DataFrame: The Weather_key column is dropped from the separated DataFrame as it contains missing values.