<a href="https://colab.research.google.com/github/Dansah2/Identifying_Age_Related_Conditions/blob/main/2_Preprocess_ICR_Identifying_Age_Related_Conditions_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ICR - Identifying Age-Related Conditions

Kaggle Dataset Download API Command:

kaggle competitions download -c icr-identify-age-related-conditions

Predict whether a subject has or has not been diagnosed with one of these conditions -- a binary classification problem.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Create and Train baseline Model

5) Save the Model

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Download data from Kaggle


#### Install Required Libraries

In [None]:
!pip install -q numpy==1.24.3
!pip install -q pandas==2.2.2

#### Import Required Libraries

In [None]:
# loading and handeling data
import numpy as np
import pandas as pd

# scaling data
from sklearn.preprocessing import StandardScaler

# one hot encoding
from sklearn.preprocessing import OneHotEncoder

# downloading data
from google.colab import drive

#### Download Data From Kaggle


In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [None]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download the kaggle data
! kaggle competitions download -c icr-identify-age-related-conditions

Downloading icr-identify-age-related-conditions.zip to /content
  0% 0.00/150k [00:00<?, ?B/s]
100% 150k/150k [00:00<00:00, 111MB/s]


In [None]:
# unzip the data
! unzip icr-identify-age-related-conditions.zip

Archive:  icr-identify-age-related-conditions.zip
  inflating: greeks.csv              
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
# create a function to read the data into a dataframe

def read_function(csv_file):

    return pd.read_csv(csv_file)

raw_train = read_function('/content/train.csv')

##Preprocess and organize the data

1) Fill null values with the mean of that column

2) Drop the necessary rows / columns

3) Save the preprocessed data to Google Drive

###Fill null values with the mean of that column

In [None]:
#fill the null values with the average of that columns values
def fill_average(data_frame, columns_filled):
  for column in columns_filled:
      avg = data_frame[column].mean()
      data_frame.fillna(value=avg, inplace=True)
      print(f'Column {column} average: {avg}')

fill_average(raw_train, ['BQ', 'CB', 'CC', 'DU', 'EL', 'FC', 'FL', 'FS', 'GL'])

Column BQ average: 98.32873688509873
Column CB average: 77.17295015521911
Column CC average: 1.1635499699761689
Column DU average: 1.9593442513534824
Column EL average: 72.37800659903715
Column FC average: 71.38526558328216
Column FL average: 5.583758864943435
Column FS average: 0.7388662929824918
Column GL average: 8.676500164278929


### Drop the necessary rows / columns

In [None]:
# remove unnecessary columns
def drop_columns(data_frame, col_name: list):
  try:
    data_frame = data_frame.drop(columns=col_name)
    print(f'The remaining colums are: {data_frame.columns}')
  except:
    print(f'The column(s) have already been dropped, the remaining are: {data_frame.columns}')
  return data_frame

raw_train = drop_columns(raw_train, ["Id"])

The remaining colums are: Index(['AB', 'AF', 'AH', 'AM', 'AR', 'AX', 'AY', 'AZ', 'BC', 'BD ', 'BN', 'BP',
       'BQ', 'BR', 'BZ', 'CB', 'CC', 'CD ', 'CF', 'CH', 'CL', 'CR', 'CS', 'CU',
       'CW ', 'DA', 'DE', 'DF', 'DH', 'DI', 'DL', 'DN', 'DU', 'DV', 'DY', 'EB',
       'EE', 'EG', 'EH', 'EJ', 'EL', 'EP', 'EU', 'FC', 'FD ', 'FE', 'FI', 'FL',
       'FR', 'FS', 'GB', 'GE', 'GF', 'GH', 'GI', 'GL', 'Class'],
      dtype='object')


###Save the preprocessed data to Google Drive

In [None]:
# export the data
clean_train.to_csv('/content/drive/My Drive/ucla.edu_folder/ICR_Project/non_encoded_train_df.csv', index=False)

#verify it was exported
df_verify = pd.read_csv('/content/drive/My Drive/ucla.edu_folder/ICR_Project/non_encoded_train_df.csv')
print(df_verify)

           AB          AF          AH          AM         AR        AX  \
0    0.209377  3109.03329   85.200147   22.394407   8.138688  0.699861   
1    0.145282   978.76416   85.200147   36.968889   8.138688  3.632190   
2    0.470030  2635.10654   85.200147   32.360553   8.138688  6.732840   
3    0.252107  3819.65177  120.201618   77.112203   8.138688  3.685344   
4    0.380297  3733.04844   85.200147   14.103738   8.138688  3.942255   
..        ...         ...         ...         ...        ...       ...   
612  0.149555  3130.05946  123.763599    9.513984  13.020852  3.499305   
613  0.435846  5462.03438   85.200147   46.551007  15.973224  5.979825   
614  0.427300  2459.10720  130.138587   55.355778  10.005552  8.070549   
615  0.363205  1263.53524   85.200147   23.685856   8.138688  7.981959   
616  0.482849  2672.53426  546.663930  112.006102   8.138688  3.198099   

           AY         AZ          BC         BD   ...         FL        FR  \
0    0.025578   9.812214    5.555