# Pre-processing and Training Data Development

First and foremost I load all the libraries needed as well as the revised csv data. For the rest of this I will only be focusing on the second (Bipolar disorder df) and fourth (Eating disorder df) dataframe, since they are the only 2 important dataframes for my problem statement.

In [1]:
#Loading all libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [2]:
#Importing data
df1 = pd.read_csv('cleaned_data(1).csv') 
df2 = pd.read_csv('cleaned_data(2).csv')
df3 = pd.read_csv('cleaned_data(3).csv')
df4 = pd.read_csv('cleaned_data(4).csv')
df5 = pd.read_csv('cleaned_data(5).csv')

After loading the csv files I check the two I will need for the modeling step of the capstone.

In [3]:
df2.head()

Unnamed: 0,index,Country,Code,Year,Percentage of Prevalence (Bipolar Disorders(M)),Percentage of Prevalence (Bipolar Disorders(F)),Population (historical estimates)
0,1,Afghanistan,AFG,1990,0.675452,0.762342,12412311.0
1,2,Afghanistan,AFG,1991,0.674992,0.762142,13299016.0
2,3,Afghanistan,AFG,1992,0.674579,0.761958,14485543.0
3,4,Afghanistan,AFG,1993,0.674206,0.761774,15816601.0
4,5,Afghanistan,AFG,1994,0.673876,0.761599,17075728.0


In [4]:
df4.head()

Unnamed: 0,index,Country,Code,Year,Percentage of Prevalence (Eating disorders(M)),Percentage of Prevalence (Eating disorders(F)),Population (historical estimates)
0,1,Afghanistan,AFG,1990,0.091421,0.164942,12412311.0
1,2,Afghanistan,AFG,1991,0.088841,0.15985,13299016.0
2,3,Afghanistan,AFG,1992,0.086286,0.15523,14485543.0
3,4,Afghanistan,AFG,1993,0.084179,0.150636,15816601.0
4,5,Afghanistan,AFG,1994,0.081881,0.146573,17075728.0


I see that the only categorical data is 'Country' so I make a new dataframe for it. I decide to leave out the 'Code' column since it is basically the same and only focus on the 'Country column.

In [5]:
dummy = pd.get_dummies(df2['Country'])
dummy.head()

Unnamed: 0,Afghanistan,Africa,African Region (WHO),Akrotiri and Dhekelia,Albania,Algeria,American Samoa,Andorra,Angola,Anguilla,...,World Bank Upper Middle Income,Wuerttemburg,Yemen,Yemen Arab Republic,Yemen People's Republic,Yugoslavia,Zambia,Zanzibar,Zimbabwe,Åland Islands
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
dummy_2 = pd.get_dummies(df4['Country'])
dummy.head()

Unnamed: 0,Afghanistan,Africa,African Region (WHO),Akrotiri and Dhekelia,Albania,Algeria,American Samoa,Andorra,Angola,Anguilla,...,World Bank Upper Middle Income,Wuerttemburg,Yemen,Yemen Arab Republic,Yemen People's Republic,Yugoslavia,Zambia,Zanzibar,Zimbabwe,Åland Islands
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Once both are made I use "concat" to merge the talbe and the dummy together.

In [7]:
df2 = pd.concat([df2, dummy], axis=1)
df2.head()

Unnamed: 0,index,Country,Code,Year,Percentage of Prevalence (Bipolar Disorders(M)),Percentage of Prevalence (Bipolar Disorders(F)),Population (historical estimates),Afghanistan,Africa,African Region (WHO),...,World Bank Upper Middle Income,Wuerttemburg,Yemen,Yemen Arab Republic,Yemen People's Republic,Yugoslavia,Zambia,Zanzibar,Zimbabwe,Åland Islands
0,1,Afghanistan,AFG,1990,0.675452,0.762342,12412311.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Afghanistan,AFG,1991,0.674992,0.762142,13299016.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Afghanistan,AFG,1992,0.674579,0.761958,14485543.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Afghanistan,AFG,1993,0.674206,0.761774,15816601.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Afghanistan,AFG,1994,0.673876,0.761599,17075728.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
df4 = pd.concat([df4, dummy_2], axis=1)
df4.head()

Unnamed: 0,index,Country,Code,Year,Percentage of Prevalence (Eating disorders(M)),Percentage of Prevalence (Eating disorders(F)),Population (historical estimates),Afghanistan,Africa,African Region (WHO),...,World Bank Upper Middle Income,Wuerttemburg,Yemen,Yemen Arab Republic,Yemen People's Republic,Yugoslavia,Zambia,Zanzibar,Zimbabwe,Åland Islands
0,1,Afghanistan,AFG,1990,0.091421,0.164942,12412311.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Afghanistan,AFG,1991,0.088841,0.15985,13299016.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Afghanistan,AFG,1992,0.086286,0.15523,14485543.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Afghanistan,AFG,1993,0.084179,0.150636,15816601.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Afghanistan,AFG,1994,0.081881,0.146573,17075728.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Now I will use StandardScaler to standardize the numercal features, whixh are the percentage of prevalence M/F for both eating and bipolar disorder and the population.

In [9]:
categorical_features = ['Country']
numeric_features = ['Percentage of Prevalence (Eating disorders(M))', 'Percentage of Prevalence (Eating disorders(F))', 'Population (historical estimates)']

In [10]:
# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [11]:
df4.columns

Index(['index', 'Country', 'Code', 'Year',
       'Percentage of Prevalence (Eating disorders(M))',
       'Percentage of Prevalence (Eating disorders(F))',
       'Population (historical estimates)', 'Afghanistan', 'Africa',
       'African Region (WHO)',
       ...
       'World Bank Upper Middle Income', 'Wuerttemburg', 'Yemen',
       'Yemen Arab Republic', 'Yemen People's Republic', 'Yugoslavia',
       'Zambia', 'Zanzibar', 'Zimbabwe', 'Åland Islands'],
      dtype='object', length=322)

In [15]:
df2.columns

Index(['index', 'Country', 'Code', 'Year',
       'Percentage of Prevalence (Bipolar Disorders(M))',
       'Percentage of Prevalence (Bipolar Disorders(F))',
       'Population (historical estimates)', 'Afghanistan', 'Africa',
       'African Region (WHO)',
       ...
       'World Bank Upper Middle Income', 'Wuerttemburg', 'Yemen',
       'Yemen Arab Republic', 'Yemen People's Republic', 'Yugoslavia',
       'Zambia', 'Zanzibar', 'Zimbabwe', 'Åland Islands'],
      dtype='object', length=322)

I now use X and y to define my features and then test and process the data found.

In [20]:
# Define features (X) and target (y)
X = df2.drop(['Percentage of Prevalence (Bipolar Disorders(M))',
              'Percentage of Prevalence (Bipolar Disorders(F))'], axis=1)
y = df2[['Percentage of Prevalence (Bipolar Disorders(M))',
          'Percentage of Prevalence (Bipolar Disorders(F))']]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a preprocessing and modeling pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Preprocess training data
X_train_processed = pipeline.fit_transform(X_train)

# Preprocess testing data
X_test_processed = pipeline.transform(X_test)

Now I print my results.

In [23]:
print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

X_train:        index         Country Code  Year  Population (historical estimates)  \
9427    9428  Cayman Islands  CYM  2012                            58963.0   
40596  40597         Reunion  REU  1920                           173260.0   
39496  39497     Philippines  PHL  1842                          3193695.0   
54565  54566       Venezuela  VEN  1830                           893324.0   
39266  39267            Peru  PER  1871                          2653867.0   
...      ...             ...  ...   ...                                ...   
44732  44733       Singapore  SGP  1933                           631717.0   
54343  54344         Vatican  VAT  1897                              910.0   
38158  38159       Palestine  PSE  1800                           165944.0   
860      861         Algeria  DZA  1200                          1899989.0   
15795  15796         Estonia  EST  1847                           462351.0   

       Afghanistan  Africa  African Region (WHO)  Akro

In [22]:
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (45115, 320)
Shape of X_test: (11279, 320)
Shape of y_train: (45115, 2)
Shape of y_test: (11279, 2)


In [21]:
print("Summary statistics of X_train:")
print(X_train.describe())

print("\nSummary statistics of X_test:")
print(X_test.describe())

print("\nSummary statistics of y_train:")
print(y_train.describe())

print("\nSummary statistics of y_test:")
print(y_test.describe())

Summary statistics of X_train:
              index          Year  Population (historical estimates)  \
count  45115.000000  45115.000000                       4.454500e+04   
mean   28199.121290   1611.050272                       3.251739e+07   
std    16265.182914   1399.909403                       2.517726e+08   
min        1.000000 -10000.000000                       1.000000e+00   
25%    14135.500000   1832.000000                       1.333040e+05   
50%    28183.000000   1901.000000                       1.209646e+06   
75%    42304.500000   1966.000000                       5.394679e+06   
max    56394.000000   2021.000000                       7.874966e+09   

        Afghanistan        Africa  African Region (WHO)  \
count  45115.000000  45115.000000          45115.000000   
mean       0.004633      0.004322              0.000598   
std        0.067906      0.065603              0.024457   
min        0.000000      0.000000              0.000000   
25%        0.000000      

Then I repeat with the Eating disorder table.

In [24]:
# Define features (X) and target (y)
X = df4.drop(['Percentage of Prevalence (Eating disorders(M))',
       'Percentage of Prevalence (Eating disorders(F))'], axis=1)
y = df4[['Percentage of Prevalence (Eating disorders(M))',
       'Percentage of Prevalence (Eating disorders(F))']]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a preprocessing and modeling pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Preprocess training data
X_train_processed = pipeline.fit_transform(X_train)

# Preprocess testing data
X_test_processed = pipeline.transform(X_test)

In [25]:
print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

X_train:        index         Country Code  Year  Population (historical estimates)  \
9427    9428  Cayman Islands  CYM  2012                            58963.0   
40596  40597         Reunion  REU  1920                           173260.0   
39496  39497     Philippines  PHL  1842                          3193695.0   
54565  54566       Venezuela  VEN  1830                           893324.0   
39266  39267            Peru  PER  1871                          2653867.0   
...      ...             ...  ...   ...                                ...   
44732  44733       Singapore  SGP  1933                           631717.0   
54343  54344         Vatican  VAT  1897                              910.0   
38158  38159       Palestine  PSE  1800                           165944.0   
860      861         Algeria  DZA  1200                          1899989.0   
15795  15796         Estonia  EST  1847                           462351.0   

       Afghanistan  Africa  African Region (WHO)  Akro

In [26]:
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (45115, 320)
Shape of X_test: (11279, 320)
Shape of y_train: (45115, 2)
Shape of y_test: (11279, 2)


In [27]:
print("Summary statistics of X_train:")
print(X_train.describe())

print("\nSummary statistics of X_test:")
print(X_test.describe())

print("\nSummary statistics of y_train:")
print(y_train.describe())

print("\nSummary statistics of y_test:")
print(y_test.describe())

Summary statistics of X_train:
              index          Year  Population (historical estimates)  \
count  45115.000000  45115.000000                       4.454500e+04   
mean   28199.121290   1611.050272                       3.251739e+07   
std    16265.182914   1399.909403                       2.517726e+08   
min        1.000000 -10000.000000                       1.000000e+00   
25%    14135.500000   1832.000000                       1.333040e+05   
50%    28183.000000   1901.000000                       1.209646e+06   
75%    42304.500000   1966.000000                       5.394679e+06   
max    56394.000000   2021.000000                       7.874966e+09   

        Afghanistan        Africa  African Region (WHO)  \
count  45115.000000  45115.000000          45115.000000   
mean       0.004633      0.004322              0.000598   
std        0.067906      0.065603              0.024457   
min        0.000000      0.000000              0.000000   
25%        0.000000      

##### Now that all has been done and printed I move on to the modeling step.