<a href="https://colab.research.google.com/github/Dansah2/Identifying-Age-Related-Conditions/blob/main/Preprocess_ICR_Identifying_Age_Related_Conditions_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ICR - Identifying Age-Related Conditions

Kaggle Dataset Download API Command:

kaggle competitions download -c icr-identify-age-related-conditions

Predict whether a subject has or has not been diagnosed with one of these conditions -- a binary classification problem.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Create and Train baseline Model

5) Save the Model

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Download data from Kaggle


#### Install Required Libraries

In [1]:
!pip install -q -U numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.26.0 which is incompatible.
tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 1.26.0 which is incompatible.[0m[31m
[0m

#### Import Required Libraries

In [1]:
# loading and handeling data
import numpy as np
import pandas as pd

# scaling data
from sklearn.preprocessing import StandardScaler

# downloading data
from google.colab import drive

#### Download Data From Kaggle


In [2]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [4]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [5]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [6]:
# download the kaggle data
! kaggle competitions download -c icr-identify-age-related-conditions

Downloading icr-identify-age-related-conditions.zip to /content
  0% 0.00/150k [00:00<?, ?B/s]
100% 150k/150k [00:00<00:00, 72.5MB/s]


In [7]:
# unzip the data
! unzip icr-identify-age-related-conditions.zip

Archive:  icr-identify-age-related-conditions.zip
  inflating: greeks.csv              
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [8]:
# create a function to read the data into a dataframe

def read_function(csv_file):

    return pd.read_csv(csv_file)

raw_train = read_function('/content/train.csv')
raw_test = read_function('/content/test.csv')

##Preprocess and organize the data

1) Fill null values with the mean of that column

2) Drop the necessary rows / columns

3) Scale the data

4) Save the preprocessed data to Google Drive

###Fill null values with the mean of that column

In [9]:
#fill the null values with the average of that columns values
def fill_average(data_frame, columns_filled):
  for column in columns_filled:
      avg = data_frame[column].mean()
      data_frame.fillna(value=avg, inplace=True)
      print(f'Column {column} average: {avg}')

fill_average(raw_train, ['BQ', 'CB', 'CC', 'DU', 'EL', 'FC', 'FL', 'FS', 'GL'])

Column BQ average: 98.32873688509873
Column CB average: 77.17295015521911
Column CC average: 1.1635499699761689
Column DU average: 1.9593442513534824
Column EL average: 72.37800659903715
Column FC average: 71.38526558328216
Column FL average: 5.583758864943435
Column FS average: 0.7388662929824918
Column GL average: 8.676500164278929


### Drop the necessary rows / columns

In [10]:
# remove unnecessary columns
def drop_columns(data_frame, col_name: list):
  try:
    data_frame = data_frame.drop(columns=col_name)
    print(f'The remaining colums are: {data_frame.columns}')
  except:
    print(f'The column(s) have already been dropped, the remaining are: {data_frame.columns}')
  return data_frame

clean_train = drop_columns(raw_train, ["Id", "EJ"])

The remaining colums are: Index(['AB', 'AF', 'AH', 'AM', 'AR', 'AX', 'AY', 'AZ', 'BC', 'BD ', 'BN', 'BP',
       'BQ', 'BR', 'BZ', 'CB', 'CC', 'CD ', 'CF', 'CH', 'CL', 'CR', 'CS', 'CU',
       'CW ', 'DA', 'DE', 'DF', 'DH', 'DI', 'DL', 'DN', 'DU', 'DV', 'DY', 'EB',
       'EE', 'EG', 'EH', 'EL', 'EP', 'EU', 'FC', 'FD ', 'FE', 'FI', 'FL', 'FR',
       'FS', 'GB', 'GE', 'GF', 'GH', 'GI', 'GL', 'Class'],
      dtype='object')


###Scale the Data

In [16]:
def scale_data(data_frame, target):
  #instantiate scalar
  scaler = StandardScaler()

  X = data_frame.drop(columns=target)
  y = data_frame[target]

  #fit the scalar to the data
  data_frame = scaler.fit_transform(X)

  # Convert the scaled data back into a DataFrame
  scaled_dataframe = pd.DataFrame(data_frame, columns=X.columns)

  #add the target column by to the dataframe
  scaled_dataframe['Class'] = y

  #print out the column names to ensure all columns are there
  print(scaled_dataframe.columns)

  return scaled_dataframe

clean_train = scale_data(clean_train, 'Class')

Index(['AB', 'AF', 'AH', 'AM', 'AR', 'AX', 'AY', 'AZ', 'BC', 'BD ', 'BN', 'BP',
       'BQ', 'BR', 'BZ', 'CB', 'CC', 'CD ', 'CF', 'CH', 'CL', 'CR', 'CS', 'CU',
       'CW ', 'DA', 'DE', 'DF', 'DH', 'DI', 'DL', 'DN', 'DU', 'DV', 'DY', 'EB',
       'EE', 'EG', 'EH', 'EL', 'EP', 'EU', 'FC', 'FD ', 'FE', 'FI', 'FL', 'FR',
       'FS', 'GB', 'GE', 'GF', 'GH', 'GI', 'GL', 'Class'],
      dtype='object')


In [19]:
clean_train.head()

Unnamed: 0,AB,AF,AH,AM,AR,AX,AY,AZ,BC,BD,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,-0.572153,-0.170975,-0.261669,-0.237889,-0.189295,-1.900558,-0.083417,-0.173502,-0.038354,-0.405383,...,0.142031,-0.035806,-0.112683,-0.940094,-0.41026,-0.655511,-0.948991,0.531241,-0.783193,1
1,-0.709105,-1.097801,-0.261669,-0.028701,-0.189295,-0.750457,-0.083417,0.678919,-0.104787,0.048541,...,-0.448241,-0.060566,-0.029732,-1.14507,-0.41026,0.687893,-0.238862,-0.509218,1.217561,0
2,-0.015212,-0.377169,-0.261669,-0.094845,-0.189295,0.465662,-0.083417,0.519453,-0.104787,-0.071089,...,0.176114,-0.051023,0.080474,1.637944,-0.29921,-0.05185,-0.351743,-0.424754,-0.776181,0
3,-0.480851,0.138196,0.012347,0.547477,-0.189295,-0.72961,-0.083417,0.112088,-0.104787,-0.391109,...,0.044605,-0.060566,-0.079503,-0.219883,-0.342195,-0.650833,0.858232,1.101332,-0.779945,0
4,-0.206946,0.100517,-0.261669,-0.356885,-0.189295,-0.628845,-0.013229,-1.649292,1.445139,0.125327,...,0.212856,0.896815,-0.107943,-0.432313,0.09992,-0.318309,1.409422,-0.395228,-0.785365,1


In [20]:
raw_train.head()

Unnamed: 0,Id,AB,AF,AH,AM,AR,AX,AY,AZ,BC,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,000ff2bfdfe9,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,...,7.298162,1.73855,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1
1,007255e47698,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,...,0.173229,0.49706,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0
2,013f2bd269f5,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,...,7.70956,0.97556,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0
3,043ac50845d5,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,...,6.122162,0.49706,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0
4,044fb8a146ec,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,...,8.153058,48.50134,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1


###Save the preprocessed data to Google Drive

In [21]:
# export the data
clean_train.to_csv('/content/drive/My Drive/ICR_Project/scaled_train_df.csv', index=False)
#raw_test.to_csv('/content/drive/My Drive/ICR_Project/test_df.csv', index=False)

#verify it was exported
df_verify = pd.read_csv('/content/drive/My Drive/ICR_Project/scaled_train_df.csv')
print(df_verify)

           AB        AF        AH        AM        AR        AX        AY  \
0   -0.572153 -0.170975 -0.261669 -0.237889 -0.189295 -1.900558 -0.083417   
1   -0.709105 -1.097801 -0.261669 -0.028701 -0.189295 -0.750457 -0.083417   
2   -0.015212 -0.377169 -0.261669 -0.094845 -0.189295  0.465662 -0.083417   
3   -0.480851  0.138196  0.012347  0.547477 -0.189295 -0.729610 -0.083417   
4   -0.206946  0.100517 -0.261669 -0.356885 -0.189295 -0.628845 -0.013229   
..        ...       ...       ...       ...       ...       ...       ...   
612 -0.699975 -0.161828  0.040232 -0.422762  0.275215 -0.802577  0.040875   
613 -0.088253  0.852755 -0.261669  0.108831  0.556117  0.170319 -0.082686   
614 -0.106514 -0.453742  0.090140  0.235206 -0.011673  0.990330 -0.083417   
615 -0.243466 -0.973904 -0.261669 -0.219353 -0.189295  0.955584 -0.083417   
616  0.012178 -0.360885  3.350987  1.048310 -0.189295 -0.920714  0.135921   

           AZ        BC       BD   ...        FL        FR        FS  \
0  