# **Eye Cancer Patients  Jupyter Notebook**

**Section**: S20 <br>

**Group**: pandas Salle

### **Import the necessary libraries**

For Data Visualization and Data processing


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### **Initializing the DataFrame**

Load the Eye Cancer Patients Dataset into a pandas DataFrame


In [3]:
eye_cancer_df = pd.read_csv('eye_cancer_patients.csv')
eye_cancer_df.head()

Unnamed: 0,Patient_ID,Age,Gender,Cancer_Type,Laterality,Date_of_Diagnosis,Stage_at_Diagnosis,Treatment_Type,Surgery_Status,Radiation_Therapy,Chemotherapy,Outcome_Status,Survival_Time_Months,Genetic_Markers,Family_History,Country
0,PID00001,58,F,Retinoblastoma,Left,2019-01-25,Stage IV,Radiation,False,15,3,Deceased,85,,True,UK
1,PID00002,15,Other,Retinoblastoma,Right,2021-10-21,Stage III,Chemotherapy,True,69,6,In Remission,10,,True,Japan
2,PID00003,64,M,Retinoblastoma,Bilateral,2021-03-12,Stage IV,Surgery,False,47,6,In Remission,3,BRAF Mutation,False,UK
3,PID00004,33,M,Melanoma,Right,2021-05-10,Stage II,Radiation,True,36,6,Active,40,,False,Canada
4,PID00005,8,Other,Lymphoma,Left,2019-11-24,Stage I,Chemotherapy,False,14,14,In Remission,26,BRAF Mutation,True,USA


### **Preparing to Clean up the Data**

Check the information regarding the Dataset


In [4]:
eye_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Patient_ID            5000 non-null   object
 1   Age                   5000 non-null   int64 
 2   Gender                5000 non-null   object
 3   Cancer_Type           5000 non-null   object
 4   Laterality            5000 non-null   object
 5   Date_of_Diagnosis     5000 non-null   object
 6   Stage_at_Diagnosis    5000 non-null   object
 7   Treatment_Type        5000 non-null   object
 8   Surgery_Status        5000 non-null   bool  
 9   Radiation_Therapy     5000 non-null   int64 
 10  Chemotherapy          5000 non-null   int64 
 11  Outcome_Status        5000 non-null   object
 12  Survival_Time_Months  5000 non-null   int64 
 13  Genetic_Markers       2503 non-null   object
 14  Family_History        5000 non-null   bool  
 15  Country               5000 non-null   

### **Identifying Inconsistencies**

First identify the unique values in each Data Series

This would be very useful in categorical data that is found in our Eye Cancer Dataset


In [8]:
PatientID_df = eye_cancer_df['Patient_ID']
Gender_df = eye_cancer_df['Gender']
Cancer_df = eye_cancer_df['Cancer_Type']
Laterality_df = eye_cancer_df['Laterality']
Stage_df = eye_cancer_df['Stage_at_Diagnosis']
Treatment_df = eye_cancer_df['Treatment_Type']
Surgery_df = eye_cancer_df['Surgery_Status']
Outcome_df = eye_cancer_df['Outcome_Status']
Genetic_df = eye_cancer_df['Genetic_Markers']
Family_History_df = eye_cancer_df['Family_History']
Country_df = eye_cancer_df['Country']

#### `Patient_ID` variable

In [12]:
print("Patient_ID Series: ", PatientID_df.unique())

Patient_ID Series:  ['PID00001' 'PID00002' 'PID00003' ... 'PID04998' 'PID04999' 'PID05000']


#### `Gender` variable

In [6]:
print("Gender Series: ", Gender_df.unique())

Gender Series:  ['F' 'Other' 'M']


#### `Cancer_Type` variable

In [8]:
print("Cancer_Type Series:", Cancer_df.unique())

Cancer_Type Series: ['Retinoblastoma' 'Melanoma' 'Lymphoma']


#### `Laterality` variable

In [10]:
print("Laterality Series:", Laterality_df.unique())

Laterality Series: ['Left' 'Right' 'Bilateral']


#### `Stage_at_Diagnosis` variable

In [12]:
print("Stage_at_Diagnosis Series:", Stage_df.unique())

Stage_at_Diagnosis Series: ['Stage IV' 'Stage III' 'Stage II' 'Stage I']


#### `Treatment_Type` variable

In [14]:
print("Treatment_Type Series:", Treatment_df.unique())

Treatment_Type Series: ['Radiation' 'Chemotherapy' 'Surgery']


#### `Surgery_Status` variable

In [16]:
print("Surgery_Status Series:", Surgery_df.unique())

Surgery_Status Series: [False  True]


#### `Outcome_Status` variable

In [18]:
print("Outcome_Status Series:", Outcome_df.unique())

Outcome_Status Series: ['Deceased' 'In Remission' 'Active']


#### `Genetic_Markers` variable

In [6]:
print("Genetic_Markers Series:", Genetic_df.unique())

Genetic_Markers Series: [nan 'BRAF Mutation']


> As we can see, we will want to clean up this `Genetic_Markers` variable since we don't want to deal with `nan` values (Not a Number)

#### `Family_History` variable

In [22]:
print("Family_History Series:", Family_History_df.unique())

Family_History Series: [ True False]


#### `Country` variable

In [24]:
print("Country Series:", Country_df.unique())

Country Series: ['UK' 'Japan' 'Canada' 'USA' 'Australia' 'Germany' 'South Africa' 'Brazil'
 'France' 'India']


> ðŸ†— It seems like for this section of the NoteBook, the only inconsistency we found was for the `Genetic_Markers` Series, where a value such as nan (Not a Number) is being used.

#### **Cleaning up the** `Genetic_Markers` **variable**
We will want to remove the `nan` values so that when we do Exploratory Data Analysis later, we will not encounter isues<br>
First let us check if we missed any variables with a `nan` value

In [19]:
nan_variables = eye_cancer_df.columns[eye_cancer_df.isnull().any()].to_list()
print(nan_variables)

['Genetic_Markers']


> For this dataset, it seems like `Genetic_Markers` is the only variable that has a `nan` value, so we may proceed with our original plan.

#### **Do we drop or replace the** `nan` **values from** `Genetic_Markers` **variable?**
Dropping the rows with `nan` values in `Genetic_Markers` would be more convenient, but first let us check if we still have sufficient data if ever we drop them<br>
So our task is to count how many of the rows are affected by the `nan` values.

In [22]:
nan_count = Genetic_df.isnull().sum()

print("The count of NaN values in Genetic_Markers is:", nan_count)

The count of NaN values in Genetic_Markers is: 2497


> Let us check the shape of our pandas DataFrame: `eye_cancer_df` 

In [26]:
eye_cancer_df.shape
print("Our Data frame has {} rows and {} columns.".format(eye_cancer_df.shape[0], eye_cancer_df.shape[1]))

Our Data frame has 5000 rows and 16 columns.


> We know we have 5000 observations, and according to the count of our `nan` values in the `Genetic_Markers` variable, 2497 rows are affected by these `nan` values. <br> <br>
> Therefore, we cannot proceed with the dropping since it will affect about half of our observations. <br> <br>
> Instead let us, replace these `nan` values with another categorical data as `None`.

In [27]:
eye_cancer_df.loc[eye_cancer_df['Genetic_Markers'].isnull(), 'Genetic_Markers'] = 'None'

#### **Sanity Check**
Let us see if the `nan` values in the `Genetic_Markers` variables have been replace with `None`

In [39]:
replaced = eye_cancer_df.loc[eye_cancer_df['Genetic_Markers'] == 'None', 'Genetic_Markers'].shape[0]
not_replaced = eye_cancer_df['Genetic_Markers'].isnull().sum()
print("The number of rows where the nan values in Genetic_Markers variable that has been replaced with 'None' is: {} rows".format(replaced))
print("Meanwhile the number of rows that are still containg nan values is: {} rows".format(not_replaced))

The number of rows where the nan values in Genetic_Markers variable that has been replaced with 'None' is: 2497 rows
Meanwhile the number of rows that are still containg nan values is: 0 rows


The `2497` rows that has been successfully replaced with `None` in the `Genetic_Markers` variable is parallel with the number of rows that was originally `nan` values in `Genetic_Markers`

# **=========================================**
Hanggang dito lng nagawa ko HAHAHAHa **-Justine**

In [25]:
eye_cancer_df = eye_cancer_df.drop(['Age', 'Laterality', 'Date_of_Diagnosis', 'Stage_at_Diagnosis', 'Surgery_Status', 
                                             'Radiation_Therapy', 'Chemotherapy', 'Country'], axis=1)
eye_cancer_df = eye_cancer_df.drop_duplicates()
eye_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Patient_ID            5000 non-null   object
 1   Gender                5000 non-null   object
 2   Cancer_Type           5000 non-null   object
 3   Treatment_Type        5000 non-null   object
 4   Outcome_Status        5000 non-null   object
 5   Survival_Time_Months  5000 non-null   int64 
 6   Genetic_Markers       2503 non-null   object
 7   Family_History        5000 non-null   bool  
dtypes: bool(1), int64(1), object(6)
memory usage: 278.4+ KB


In [8]:
eye_cancer_df['Gender'] = eye_cancer_df['Gender'].str.strip()
eye_cancer_df['Cancer_Type'] = eye_cancer_df['Cancer_Type'].str.strip()
eye_cancer_df['Treatment_Type'] = eye_cancer_df['Treatment_Type'].str.strip()
eye_cancer_df['Outcome_Status'] = eye_cancer_df['Outcome_Status'].str.strip()
eye_cancer_df['Genetic_Markers'] = eye_cancer_df['Genetic_Markers'].str.strip()

In [9]:
eye_cancer_df.info()
eye_cancer_df.describe()
eye_cancer_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Patient_ID            5000 non-null   object
 1   Age                   5000 non-null   int64 
 2   Gender                5000 non-null   object
 3   Cancer_Type           5000 non-null   object
 4   Laterality            5000 non-null   object
 5   Date_of_Diagnosis     5000 non-null   object
 6   Stage_at_Diagnosis    5000 non-null   object
 7   Treatment_Type        5000 non-null   object
 8   Surgery_Status        5000 non-null   bool  
 9   Radiation_Therapy     5000 non-null   int64 
 10  Chemotherapy          5000 non-null   int64 
 11  Outcome_Status        5000 non-null   object
 12  Survival_Time_Months  5000 non-null   int64 
 13  Genetic_Markers       2503 non-null   object
 14  Family_History        5000 non-null   bool  
 15  Country               5000 non-null   

Unnamed: 0,Patient_ID,Age,Gender,Cancer_Type,Laterality,Date_of_Diagnosis,Stage_at_Diagnosis,Treatment_Type,Surgery_Status,Radiation_Therapy,Chemotherapy,Outcome_Status,Survival_Time_Months,Genetic_Markers,Family_History,Country
0,PID00001,58,F,Retinoblastoma,Left,2019-01-25,Stage IV,Radiation,False,15,3,Deceased,85,,True,UK
1,PID00002,15,Other,Retinoblastoma,Right,2021-10-21,Stage III,Chemotherapy,True,69,6,In Remission,10,,True,Japan
2,PID00003,64,M,Retinoblastoma,Bilateral,2021-03-12,Stage IV,Surgery,False,47,6,In Remission,3,BRAF Mutation,False,UK
3,PID00004,33,M,Melanoma,Right,2021-05-10,Stage II,Radiation,True,36,6,Active,40,,False,Canada
4,PID00005,8,Other,Lymphoma,Left,2019-11-24,Stage I,Chemotherapy,False,14,14,In Remission,26,BRAF Mutation,True,USA
