---
## **TEHREEM ZUBAIR**
## **BYTEWISE FELLOWSHIP**
## **TASK 14**
---
## **DATA CLEANING AND PREPROCESSING WITH PANDAS**
Today, we will delve into the essential tasks of data cleaning and preprocessing using Pandas, a powerful Python library widely used for data manipulation and analysis.

### **Why Data Cleaning and Preprocessing?**
- Before we can effectively analyze data or build models, it's crucial to ensure our datasets are accurate, complete, and in a format suitable for analysis. 
- Real-world data often comes with inconsistencies, missing values, outliers, and other imperfections that can hinder analysis and modeling efforts. 
- Data cleaning and preprocessing address these issues, ensuring that our data is reliable and ready for further exploration

### **Techniques in Data Cleaning and Preprocessing**
In this article we will be working on the following techniques for data cleanning and preprocessing:

**1. Handling Missing Data:** Techniques such as dropping rows/columns with missing values, filling missing values with appropriate measures (like mean, median, or specific values), or using advanced imputation methods.

**2. Data Transformation:** Converting data types, standardizing or normalizing numerical data, and transforming categorical data into numerical formats suitable for analysis.

**3. Handling Duplicates:** Identifying and removing duplicate records, ensuring data integrity.

**4. Feature Engineering:** Creating new features from existing ones to enhance model performance.

**5. Text and Date Processing:** Cleaning and transforming text data, parsing and manipulating date-time formats.

**6. Encoding Categorical Variables:** Converting categorical data into numerical form for model compatibility.

By employing these techniques systematically, we ensure that our data is robust and well-prepared for analysis, enabling us to derive meaningful insights and build reliable predictive models.

---
### **IMPORTING LIBRARIES**

In [3]:
import pandas as pd
import numpy as np

---
### **CUSTOM DATASET CREATION**
- In this task we are required to perform the basic cleaning and preprocessing tasks for that I have made a small custom datase of a hospital.
- The DataFrame contains missing values, strings with spaces, and a datetime column. 

Now let's go through each of the given tasks:

In [4]:
data = {
    'PatientID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
    'Name': ['Muhammad', 'Fatima', 'Ahmed', 'Aisha', 'Zainab', np.nan, 'Zara',
             'Ahmed Khan', 'Sara Ali', 'Bilal Ahmed', 'Zara Shah', 'Umar Farooq', 
             np.nan, 'Faisal Qureshi', 'Mariam Malik', 'Ali Raza'],
    'Age': [34, np.nan, 28, 45, 50, 29, 32, 25, np.nan, 27, 35, 40, 29, 22, 33, 30],
    'Gender': ['M', 'F', np.nan, 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'AdmissionDate': ['2023-01-15', '2023-02-20', '2023-03-10', np.nan, '2023-05-05', '2023-06-17', 
                      '2023-07-11', '2023-08-23', '2023-09-14', '2023-10-05', '2023-11-17', 
                      '2023-12-01', '2024-01-11', '2024-02-22', '2024-03-15', '2024-04-05'],
    'DischargeDate': ['2023-01-25', np.nan, '2023-03-20', '2023-04-01', '2023-05-15', '2023-06-27', 
                      '2023-07-21', '2023-09-02', '2023-09-24', '2023-10-15', '2023-11-27', 
                      '2023-12-11', np.nan, '2024-03-04', '2024-03-25', '2024-04-15'],
    'Diagnosis': ['Flu', 'Covid-19', 'Fracture', 'Flu', 'Covid-19', 'Fracture', 
                  np.nan, 'Dengue', 'Covid-19', 'Flu', 'Fracture', 'Covid-19', 
                  'Flu', 'Malaria', 'Dengue', 'Flu']
}

In [5]:
# convert data to dataframe
df = pd.DataFrame(data)

# diplay first five rows
df.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,,F,2023-02-20,,Covid-19
2,3,Ahmed,28.0,,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45.0,M,,2023-04-01,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19


---
### **HANDLING MISSING VALUES**
For handling missing values first we need to check the presence of missing values in the data frame. Let's do that!

In [6]:
# Identify missing values in dataframe
df.isnull()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,False,False,False,False,False,False,False
1,False,False,True,False,False,True,False
2,False,False,False,True,False,False,False
3,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False
5,False,True,False,False,False,False,False
6,False,False,False,False,False,False,True
7,False,False,False,False,False,False,False
8,False,False,True,False,False,False,False
9,False,False,False,False,False,False,False


In [7]:
# we can also see the total number of null values in dataframe
df.isnull().sum()

PatientID        0
Name             2
Age              2
Gender           1
AdmissionDate    1
DischargeDate    2
Diagnosis        1
dtype: int64

#### **TECHNIQUES FOR HANDLING MISSING VALUES**
Now that we have identified the missing values in dataframe we can perform following actions:

**1. Delete the column with missing values.**

**2. Delete the rows with missing values.**

**3. Interpolate missing values.**

**4. Fill missing values with a specific value.**

Now apply these methods to the dataframe one by one.

#### Delete column with missing value

In [8]:
# deleting missing values' columns
df_dropna_columns = df.dropna(axis = 1)
df_dropna_columns.head()

Unnamed: 0,PatientID
0,1
1,2
2,3
3,4
4,5


After using the delete column method all the columns have been deleted because they contained missing value.
This method is not recommended in our case because we donot want the columns to be deleted as it will remove all the required information.

#### Delete row with missing values

In [9]:
# Now let's perform the delete row operation
df_dropna_columns = df.dropna()
df_dropna_columns

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19
7,8,Ahmed Khan,25.0,F,2023-08-23,2023-09-02,Dengue
9,10,Bilal Ahmed,27.0,F,2023-10-05,2023-10-15,Flu
10,11,Zara Shah,35.0,M,2023-11-17,2023-11-27,Fracture
11,12,Umar Farooq,40.0,F,2023-12-01,2023-12-11,Covid-19
13,14,Faisal Qureshi,22.0,F,2024-02-22,2024-03-04,Malaria
14,15,Mariam Malik,33.0,M,2024-03-15,2024-03-25,Dengue
15,16,Ali Raza,30.0,F,2024-04-05,2024-04-15,Flu


But we cannot also perform this operation because if we are deleting the rows the patient's data will be removed from dataframe and we also donot want that.

Let's try the interpolation technique to fill the missing values.

#### Interpolate the missing values
In Pandas, interpolation techniques are methods used to estimate or fill missing values in a DataFrame based on existing data points. 
These techniques are particularly useful when dealing with numeric, time series or ordered data where values have a logical order or sequence. Here are some common interpolation methods available in Pandas:

In [10]:
# Interpolate missing values (only works on numerical columns)
df_interpolate = df.interpolate()
print("\nDataFrame after interpolating missing values:")
df_interpolate.head()


DataFrame after interpolating missing values:


  df_interpolate = df.interpolate()


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,31.0,F,2023-02-20,,Covid-19
2,3,Ahmed,28.0,,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45.0,M,,2023-04-01,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19


#### Fill missing values with a specific value
One more method is to fill the missing values with a specific value specified by the user.
We can use this method to handle the missing value.
- While filling values what we can do is that fill the values with a random value but it is not a good technique.
- A more realistic way would be to calculate the more occuring value or the mean value in the column and fill the value with missing value.

**COLUMN: 'Name'**

For Name column I have replaced the missing value with some 'unknown'.

In [11]:
# we can fill the name column missing value with unknown
df.fillna({'Name': 'Unknown'}, inplace = True)

**COLUMN: 'Age'**

- For the age column I have first calculated the mean age from the dataframe.
- Then I have replaced the missing values with the mean age.

In [12]:
# for age column fil with the mean age
age = df['Age'].mean()
age = age.astype('int')

df.fillna({'Age': age}, inplace = True)

**COLUMN: 'Gender', 'Diagnosis'**

For the column Gender and Diagnosis I have counted the entries and replacing missing value with most frequebt entry.

In [13]:
# For gender and diagnosis column count the values and fill with most occuring values
df[['Diagnosis']].value_counts()

Diagnosis
Flu          5
Covid-19     4
Fracture     3
Dengue       2
Malaria      1
Name: count, dtype: int64

In [14]:
# fill Diagnosis missing values with Flu
df.fillna({'Diagnosis': 'Flu'}, inplace = True)

In [15]:
# count the gender
df['Gender'].value_counts()

Gender
F    8
M    7
Name: count, dtype: int64

In [16]:
# Fill missing with F
df.fillna({'Gender': 'F'}, inplace = True)

**COLUMN: 'AdmissionDate', 'DischargeDate'**

For filling the missing values in AdmissionDate and DischargeDate column I have used the following approach:
1. First convert the AdmissionDate and DischargeDate columns to datetime format.(Presently they are in Object type)
2. Then define a custom function fill_missing_dates() that checks for missing values in the AdmissionDate and DischargeDate columns. If the AdmissionDate is missing, it fills it with the DischargeDate minus 5 days. If the DischargeDate is missing, it fills it with the AdmissionDate plus 5 days.**For this part I have taken some basic intuition from ChatGPT on how to add and subtract two dates**

3. Then apply this custom function to each row of the DataFrame using the apply() function with axis=1 to process rows.

In [17]:
df['AdmissionDate'].dtype

dtype('O')

In [18]:
# Convert columns to datetime format
df['AdmissionDate'] = pd.to_datetime(df['AdmissionDate'], format='%Y-%m-%d')
df['DischargeDate'] = pd.to_datetime(df['DischargeDate'], format='%Y-%m-%d')


In [19]:
# Custom function to fill missing dates
def fill_missing_dates(row):
    if pd.isnull(row['AdmissionDate']) and not pd.isnull(row['DischargeDate']):
        row['AdmissionDate'] = row['DischargeDate'] - pd.Timedelta(days=5)
    if pd.isnull(row['DischargeDate']) and not pd.isnull(row['AdmissionDate']):
        row['DischargeDate'] = row['AdmissionDate'] + pd.Timedelta(days=5)
    return row

In [20]:
# Apply the custom function to each row
df = df.apply(fill_missing_dates, axis=1)

In [21]:
# Final dataframe
df.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32.0,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28.0,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45.0,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19


In [22]:
df.isnull().sum()

PatientID        0
Name             0
Age              0
Gender           0
AdmissionDate    0
DischargeDate    0
Diagnosis        0
dtype: int64

In [23]:
df.to_csv('Hospital_data.csv', index=False)

#### Forward filling
Forward filling propagates the last valid observation forward along a column. This means that the missing values are filled with the most recent preceding non-missing value.

In [24]:
df_c = df.copy()
# Fill missing values using forward fill
df_ffill = df_c.fillna(method='ffill')
print("\nDataFrame after forward filling missing values:")
df_ffill.head()


DataFrame after forward filling missing values:


  df_ffill = df_c.fillna(method='ffill')


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32.0,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28.0,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45.0,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19


#### Backward Filling
Backward filling propagates the next valid observation backward along a column. This means that the missing values are filled with the next non-missing value.

In [25]:
# Fill missing values using backward fill
df_bfill = df_c.fillna(method='bfill')
print("\nDataFrame after backward filling missing values:")
df_bfill.head()


DataFrame after backward filling missing values:


  df_bfill = df_c.fillna(method='bfill')


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34.0,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32.0,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28.0,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45.0,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50.0,F,2023-05-05,2023-05-15,Covid-19


---
### **CORRECTING DATATYPES:**
Now that we have handles all the missing values let's move forward to store all the columns in correct datatypes.

First check the datatypes of all the columns in the datafranme.

In [26]:
df.dtypes

PatientID                 int64
Name                     object
Age                     float64
Gender                   object
AdmissionDate    datetime64[ns]
DischargeDate    datetime64[ns]
Diagnosis                object
dtype: object

The correct datatypes are:
1. PatientID ->       int64
2. Name ->     object
3. Age ->       int64
4. Gender ->       object
5. AdmissionDate ->       datetime64[ns]
6. DischargeDate ->       datetime64[ns]
7. Diagnosis ->       object

All columns are in correct format except for the age column.

In [27]:
# Convert columns to specified data types
df['Age'] = df['Age'].astype(int)

In [28]:
df.dtypes

PatientID                 int64
Name                     object
Age                       int64
Gender                   object
AdmissionDate    datetime64[ns]
DischargeDate    datetime64[ns]
Diagnosis                object
dtype: object

---
### **APPLYING A FUNCTION TO TRANSFORM VALUES**
In Pandas, you can apply a function to transform values in a DataFrame using the apply() function. This allows you to modify data in a column or across multiple columns based on custom logic defined by a function. 

First let's just make a copy of dataframe and work with it so that cahnges are not made in the original one.

In [29]:
df_copy = df.copy()

In [30]:
# Convert the age in years into age in months
df_copy['Age(month)'] = df_copy['Age'].apply(lambda x:x*12)
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis,Age(month)
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu,408
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19,384
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture,336
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu,540
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19,600


---
### **NORMALIZATION AND STANDARDIZATION**
Min-Max scaling and z-score normalization are two common techniques used to transform numerical data within a specific range or distribution. 

**MIN-MAX Scaling**

Min-Max scaling, also known as normalization, transforms the values of a numeric column to a common scale, typically between 0 and 1. It preserves the shape of the original distribution while compressing the range of values

In [31]:
# Normalize the age using Min-Max Scaling
df_copy['Age_normalized'] = (df_copy['Age'] - df_copy['Age'].min()) / (df_copy['Age'].max() - df_copy['Age'].min())
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis,Age(month),Age_normalized
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu,408,0.428571
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19,384,0.357143
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture,336,0.214286
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu,540,0.821429
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19,600,1.0


**Z-score Normalization (Standardization)**

Z-score normalization, or standardization, transforms the values of a numeric column to have a mean of 0 and a standard deviation of 1. It adjusts the values so that they are centered around 0 and scaled relative to the standard deviation.

In [32]:
# standardize a column( z - score normalization)
df_copy['AgeZScore'] = (df_copy['Age'] - df_copy['Age'].mean()) / df_copy['Age'].std()
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis,Age(month),Age_normalized,AgeZScore
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu,408,0.428571,0.182546
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19,384,0.357143,-0.095619
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture,336,0.214286,-0.651949
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu,540,0.821429,1.712452
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19,600,1.0,2.407864


---
### **DUPLICATE ROWS**
Handling duplicate rows in a DataFrame involves identifying and optionally removing them. Pandas provides a straightforward way to do this with the duplicated() and drop_duplicates() methods. Here's how you can use these methods:

In [33]:
df.duplicated().sum()

0

As we have 0 duplicate rows in our dataframe , so we need to take no action for duplicate rows.
But to get a better understanding on how to handle these duplicate rows let's add some duplicate rows in the dataframe.

In [34]:
# ADDING SOME DUPLICATE ROWS TO UNEDRSTAND HOW TO REMOVE THEM
df.loc[16] = df.loc[15]
df.loc[17] = df.loc[13]
df.loc[18] = df.loc[4]

In [35]:
df

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19
5,6,Unknown,29,F,2023-06-17,2023-06-27,Fracture
6,7,Zara,32,M,2023-07-11,2023-07-21,Flu
7,8,Ahmed Khan,25,F,2023-08-23,2023-09-02,Dengue
8,9,Sara Ali,32,M,2023-09-14,2023-09-24,Covid-19
9,10,Bilal Ahmed,27,F,2023-10-05,2023-10-15,Flu


In [36]:
df.duplicated().sum()

3

In [37]:
# Drop duplicate rows
df_dropped_duplicates = df.drop_duplicates()
print("\nDataFrame after dropping duplicate rows:")
df_dropped_duplicates


DataFrame after dropping duplicate rows:


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19
5,6,Unknown,29,F,2023-06-17,2023-06-27,Fracture
6,7,Zara,32,M,2023-07-11,2023-07-21,Flu
7,8,Ahmed Khan,25,F,2023-08-23,2023-09-02,Dengue
8,9,Sara Ali,32,M,2023-09-14,2023-09-24,Covid-19
9,10,Bilal Ahmed,27,F,2023-10-05,2023-10-15,Flu


In [38]:
# Drop duplicate rows based on specific columns
df_dropped_duplicates_subset = df.drop_duplicates(subset=['Diagnosis'])
print("\nDataFrame after dropping duplicate rows based on 'Diagnosis' column:")
df_dropped_duplicates_subset


DataFrame after dropping duplicate rows based on 'Diagnosis' column:


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture
7,8,Ahmed Khan,25,F,2023-08-23,2023-09-02,Dengue
13,14,Faisal Qureshi,22,F,2024-02-22,2024-03-04,Malaria


**From the above two methods the dropping duplicates on the basis of column Diagnosis is not a good option but dropping on the basis of PatientID is good because the PatientID should be unique.**
So we drop the entries on the basis of PatientID column.

In [39]:
# Drop duplicate rows based on specific columns
df = df.drop_duplicates(subset=['PatientID'])
print("\nDataFrame after dropping duplicate rows based on 'Diagnosis' column:")
df


DataFrame after dropping duplicate rows based on 'Diagnosis' column:


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19
5,6,Unknown,29,F,2023-06-17,2023-06-27,Fracture
6,7,Zara,32,M,2023-07-11,2023-07-21,Flu
7,8,Ahmed Khan,25,F,2023-08-23,2023-09-02,Dengue
8,9,Sara Ali,32,M,2023-09-14,2023-09-24,Covid-19
9,10,Bilal Ahmed,27,F,2023-10-05,2023-10-15,Flu


In [40]:
df.duplicated().sum()

0

---
### **WORKING WITH STRINGS**

**Convert columns of names, gender and disgnosis to lower case**

- I have converted the three colunms using the funtion.

- But one can also convert them one by one just by using 
df['COLUMN NAME'].str.lower()

In [41]:
df[['Name', 'Gender', 'Diagnosis']] = df[['Name', 'Gender', 'Diagnosis']].apply(lambda x: x.str.lower())
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['Name', 'Gender', 'Diagnosis']] = df[['Name', 'Gender', 'Diagnosis']].apply(lambda x: x.str.lower())


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,muhammad,34,m,2023-01-15,2023-01-25,flu
1,2,fatima,32,f,2023-02-20,2023-02-25,covid-19
2,3,ahmed,28,f,2023-03-10,2023-03-20,fracture
3,4,aisha,45,m,2023-03-27,2023-04-01,flu
4,5,zainab,50,f,2023-05-05,2023-05-15,covid-19


**Remove leading and trailing spaces from string values in a column.**

In [42]:
df[['Name', 'Gender', 'Diagnosis']] = df[['Name', 'Gender', 'Diagnosis']].apply(lambda x: x.str.strip())
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['Name', 'Gender', 'Diagnosis']] = df[['Name', 'Gender', 'Diagnosis']].apply(lambda x: x.str.strip())


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,muhammad,34,m,2023-01-15,2023-01-25,flu
1,2,fatima,32,f,2023-02-20,2023-02-25,covid-19
2,3,ahmed,28,f,2023-03-10,2023-03-20,fracture
3,4,aisha,45,m,2023-03-27,2023-04-01,flu
4,5,zainab,50,f,2023-05-05,2023-05-15,covid-19


**Replace a specific substring in a column with another substring**

Let's just replace flu with influenza!

In [43]:
df['Diagnosis'] = df['Diagnosis'].str.replace('flu', 'influenza')
print("\nDataFrame after replacing 'flu' with 'influenza' in 'Diagnosis' column:")
df.head()


DataFrame after replacing 'flu' with 'influenza' in 'Diagnosis' column:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Diagnosis'] = df['Diagnosis'].str.replace('flu', 'influenza')


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
0,1,muhammad,34,m,2023-01-15,2023-01-25,influenza
1,2,fatima,32,f,2023-02-20,2023-02-25,covid-19
2,3,ahmed,28,f,2023-03-10,2023-03-20,fracture
3,4,aisha,45,m,2023-03-27,2023-04-01,influenza
4,5,zainab,50,f,2023-05-05,2023-05-15,covid-19


**Extract a substring from each value in a column**
- Here i am extracting the first three characters from Diagnosis column.
- I donot want the changes to prevail in the original datframe so fo rthat i am using copied dataframe.

In [44]:
# Extract a substring from each value in a column
df_copy['Diagnosis(abbr)'] = df_copy['Diagnosis'].str[:3]
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr)
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu,408,0.428571,0.182546,Flu
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19,384,0.357143,-0.095619,Cov
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture,336,0.214286,-0.651949,Fra
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu,540,0.821429,1.712452,Flu
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19,600,1.0,2.407864,Cov


---
### **WORKING WITH DATETIME**

**Extract year, month, and day from a datetime column**

I donot want the changes to prevail in the original datframe so fo rthat i am using copied dataframe.

In [45]:
# Extract year, month, and day from a datetime column
df_copy['AdmissionYear'] = df_copy['AdmissionDate'].dt.year
df_copy['AdmissionMonth'] = df_copy['AdmissionDate'].dt.month
df_copy['AdmissionDay'] = df_copy['AdmissionDate'].dt.day
df_copy['AdmissionTime'] = df_copy['AdmissionDate'].dt.time

print("\nDataFrame after extracting year, month, and day from 'AdmissionDate':")
df_copy.head()


DataFrame after extracting year, month, and day from 'AdmissionDate':


Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime
0,1,Muhammad,34,M,2023-01-15,2023-01-25,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00
1,2,Fatima,32,F,2023-02-20,2023-02-25,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00
2,3,Ahmed,28,F,2023-03-10,2023-03-20,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00
3,4,Aisha,45,M,2023-03-27,2023-04-01,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00
4,5,Zainab,50,F,2023-05-05,2023-05-15,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00


In [46]:
df_copy.drop(columns=['AdmissionDate'], inplace = True)

In [47]:
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,DischargeDate,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime
0,1,Muhammad,34,M,2023-01-25,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00
1,2,Fatima,32,F,2023-02-25,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00
2,3,Ahmed,28,F,2023-03-20,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00
3,4,Aisha,45,M,2023-04-01,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00
4,5,Zainab,50,F,2023-05-15,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00


In [48]:
# Extract year, month, and day from a datetime column
df_copy['DischargeYear'] = df_copy['DischargeDate'].dt.year
df_copy['DischargeMonth'] = df_copy['DischargeDate'].dt.month
df_copy['DischargeDay'] = df_copy['DischargeDate'].dt.day
df_copy['DischargeTime'] = df_copy['DischargeDate'].dt.time

df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,DischargeDate,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime,DischargeYear,DischargeMonth,DischargeDay,DischargeTime
0,1,Muhammad,34,M,2023-01-25,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00,2023,1,25,00:00:00
1,2,Fatima,32,F,2023-02-25,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00,2023,2,25,00:00:00
2,3,Ahmed,28,F,2023-03-20,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00,2023,3,20,00:00:00
3,4,Aisha,45,M,2023-04-01,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00,2023,4,1,00:00:00
4,5,Zainab,50,F,2023-05-15,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00,2023,5,15,00:00:00


In [49]:
df_copy.drop(columns=['DischargeDate'], inplace = True)
df_copy.head()

Unnamed: 0,PatientID,Name,Age,Gender,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime,DischargeYear,DischargeMonth,DischargeDay,DischargeTime
0,1,Muhammad,34,M,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00,2023,1,25,00:00:00
1,2,Fatima,32,F,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00,2023,2,25,00:00:00
2,3,Ahmed,28,F,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00,2023,3,20,00:00:00
3,4,Aisha,45,M,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00,2023,4,1,00:00:00
4,5,Zainab,50,F,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00,2023,5,15,00:00:00


**Filter rows based on a date range**

In [50]:
# Filter rows based on a date range
df_filtered_dates = df[(df['AdmissionDate'] >= '2023-03-01') & (df['AdmissionDate'] <= '2023-05-31')]
df_filtered_dates

Unnamed: 0,PatientID,Name,Age,Gender,AdmissionDate,DischargeDate,Diagnosis
2,3,ahmed,28,f,2023-03-10,2023-03-20,fracture
3,4,aisha,45,m,2023-03-27,2023-04-01,influenza
4,5,zainab,50,f,2023-05-05,2023-05-15,covid-19


---
### **WORKING WITH CATEGORICAL DATA**
Working with categorical data in Pandas involves several common tasks such as converting categorical data to numerical format, grouping values, and creating new columns based on categories.
##### **ONE-HOT ENCODING**
**Use One-Hot Encoding When:**
1. Nominal Variables: The categorical variable is nominal (no inherent order or ranking). Examples include gender, color, and city names.
2. Non-Linear Models: Algorithms that do not assume any ordinal relationship between the categories (e.g., decision trees, random forests, k-nearest neighbors, and neural networks) often benefit from one-hot encoding.
3. Avoiding Ordinality: You want to avoid introducing any ordinal relationship between the categories, which could mislead the model.

##### **LABEL ENCODING:**
**Use Label Encoding When:**
1. Ordinal Variables: The categorical variable is ordinal (there is an inherent order or ranking). Examples include ratings (e.g., low, medium, high), education level, or experience level.
2. Linear Models: Algorithms that can handle ordinal relationships (e.g., linear regression, logistic regression, support vector machines) might benefit from label encoding if the categories have a meaningful order.

The categorical columns in our dataframe are:
1. Gender
2. Diagnosis
Both are Ordinal so One-hot encoding is better, but to  understand both methods i am applying one method on each column.

In [51]:
# Convert a categorical column to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Gender'], prefix='Gender')
df.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Diagnosis,Gender_f,Gender_m
0,1,muhammad,34,2023-01-15,2023-01-25,influenza,False,True
1,2,fatima,32,2023-02-20,2023-02-25,covid-19,True,False
2,3,ahmed,28,2023-03-10,2023-03-20,fracture,True,False
3,4,aisha,45,2023-03-27,2023-04-01,influenza,False,True
4,5,zainab,50,2023-05-05,2023-05-15,covid-19,True,False


In [53]:
# Convert a categorical column to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Diagnosis'], prefix='')
df.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False


In [52]:
# Convert a categorical column to numerical using label encoding
df_copy['DiagnosisLabel'] = df_copy['Diagnosis'].astype('category').cat.codes
print("\nDataFrame after label encoding 'Diagnosis' column:")
df_copy.head()


DataFrame after label encoding 'Diagnosis' column:


Unnamed: 0,PatientID,Name,Age,Gender,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime,DischargeYear,DischargeMonth,DischargeDay,DischargeTime,DiagnosisLabel
0,1,Muhammad,34,M,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00,2023,1,25,00:00:00,2
1,2,Fatima,32,F,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00,2023,2,25,00:00:00,0
2,3,Ahmed,28,F,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00,2023,3,20,00:00:00,3
3,4,Aisha,45,M,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00,2023,4,1,00:00:00,2
4,5,Zainab,50,F,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00,2023,5,15,00:00:00,0


**Group values in a categorical column and create a new column with grouped categories**

In [54]:
 df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 60, 90], labels=['Young', 'Middle-aged', 'Old'])
print("\nDataFrame after grouping 'Age' into categories:")
df.head()


DataFrame after grouping 'Age' into categories:


Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged


---
### **DATAFRAME CONCATENATION**
Concatenation in Pandas is used to combine two or more DataFrames along a particular axis (either rows or columns). This operation is commonly used to merge datasets with similar or complementary data.

In [55]:
# Merge two DataFrames based on a common column
df_additional = pd.DataFrame({
    'PatientID': [1, 2, 3, 4, 5, 6],
    'BloodType': ['A', 'B', 'O', 'AB', 'A', 'B']
})
df_merged = pd.merge(df, df_additional, on='PatientID')
print("\nMerged DataFrame:")
df_merged


Merged DataFrame:


Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,BloodType
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,A
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,B
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,O
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,AB
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,A
5,6,unknown,29,2023-06-17,2023-06-27,True,False,False,False,True,False,False,Young,B


**Vertical Concatenation (Appending Rows)**

To concatenate DataFrames vertically (i.e., to add rows), you can use the pd.concat() function with axis=0.

In [56]:
# Concatenate two DataFrames vertically
df_additional_rows = pd.DataFrame({
    'PatientID': [7, 8],
    'Name': ['Emily White', 'Michael Green'],
    'Age': [32, 41],
    'Gender': ['F', 'M'],
    'AdmissionDate': ['2023-07-01', '2023-07-05'],
    'DischargeDate': ['2023-07-10', '2023-07-15'],
    'Diagnosis': ['Flu', 'Fracture']
})
df_concat_vert = pd.concat([df, df_additional_rows], ignore_index=True)
print("\nDataFrame after vertical concatenation:")
df_concat_vert


DataFrame after vertical concatenation:


Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,Gender,Diagnosis
0,1,muhammad,34,2023-01-15 00:00:00,2023-01-25 00:00:00,False,True,False,False,False,True,False,Middle-aged,,
1,2,fatima,32,2023-02-20 00:00:00,2023-02-25 00:00:00,True,False,True,False,False,False,False,Middle-aged,,
2,3,ahmed,28,2023-03-10 00:00:00,2023-03-20 00:00:00,True,False,False,False,True,False,False,Young,,
3,4,aisha,45,2023-03-27 00:00:00,2023-04-01 00:00:00,False,True,False,False,False,True,False,Middle-aged,,
4,5,zainab,50,2023-05-05 00:00:00,2023-05-15 00:00:00,True,False,True,False,False,False,False,Middle-aged,,
5,6,unknown,29,2023-06-17 00:00:00,2023-06-27 00:00:00,True,False,False,False,True,False,False,Young,,
6,7,zara,32,2023-07-11 00:00:00,2023-07-21 00:00:00,False,True,False,False,False,True,False,Middle-aged,,
7,8,ahmed khan,25,2023-08-23 00:00:00,2023-09-02 00:00:00,True,False,False,True,False,False,False,Young,,
8,9,sara ali,32,2023-09-14 00:00:00,2023-09-24 00:00:00,False,True,True,False,False,False,False,Middle-aged,,
9,10,bilal ahmed,27,2023-10-05 00:00:00,2023-10-15 00:00:00,True,False,False,False,False,True,False,Young,,


**Horizontal Concatenation (Adding Columns)**

To concatenate DataFrames horizontally (i.e., to add columns), use the pd.concat() function with axis=1.

In [57]:
# Concatenate two DataFrames horizontally
df_additional_cols = pd.DataFrame({
    'PatientID': [1, 2, 3, 4, 5, 6],
    'HospitalWing': ['East', 'West', 'North', 'South', 'East', 'West']
})
df_concat_horiz = pd.concat([df, df_additional_cols], axis=1)
print("\nDataFrame after horizontal concatenation:")
df_concat_horiz


DataFrame after horizontal concatenation:


Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,PatientID.1,HospitalWing
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,1.0,East
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,2.0,West
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,3.0,North
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,4.0,South
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,5.0,East
5,6,unknown,29,2023-06-17,2023-06-27,True,False,False,False,True,False,False,Young,6.0,West
6,7,zara,32,2023-07-11,2023-07-21,False,True,False,False,False,True,False,Middle-aged,,
7,8,ahmed khan,25,2023-08-23,2023-09-02,True,False,False,True,False,False,False,Young,,
8,9,sara ali,32,2023-09-14,2023-09-24,False,True,True,False,False,False,False,Middle-aged,,
9,10,bilal ahmed,27,2023-10-05,2023-10-15,True,False,False,False,False,True,False,Young,,


In [58]:
# Create a new column based on existing columns
df['StayDuration'] = (pd.to_datetime(df['DischargeDate'], errors='coerce') - df['AdmissionDate']).dt.days
df.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,StayDuration
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,10
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,5
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,10
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,5
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,10


In [59]:
# Discretize a continuous column into bins
df_copy['AgeBinned'] = pd.cut(df_copy['Age'], bins=[0, 18, 35, 50, 65, 100], labels=['Child', 'Youth', 'Adult', 'Middle-aged', 'Senior'])
print("\nDataFrame after discretizing 'Age' into bins:")
df_copy.head()


DataFrame after discretizing 'Age' into bins:


Unnamed: 0,PatientID,Name,Age,Gender,Diagnosis,Age(month),Age_normalized,AgeZScore,Diagnosis(abbr),AdmissionYear,AdmissionMonth,AdmissionDay,AdmissionTime,DischargeYear,DischargeMonth,DischargeDay,DischargeTime,DiagnosisLabel,AgeBinned
0,1,Muhammad,34,M,Flu,408,0.428571,0.182546,Flu,2023,1,15,00:00:00,2023,1,25,00:00:00,2,Youth
1,2,Fatima,32,F,Covid-19,384,0.357143,-0.095619,Cov,2023,2,20,00:00:00,2023,2,25,00:00:00,0,Youth
2,3,Ahmed,28,F,Fracture,336,0.214286,-0.651949,Fra,2023,3,10,00:00:00,2023,3,20,00:00:00,3,Youth
3,4,Aisha,45,M,Flu,540,0.821429,1.712452,Flu,2023,3,27,00:00:00,2023,4,1,00:00:00,2,Adult
4,5,Zainab,50,F,Covid-19,600,1.0,2.407864,Cov,2023,5,5,00:00:00,2023,5,15,00:00:00,0,Adult


---
### **POLYNOMIAL FITTING**

In [60]:
# Create polynomial features from existing numerical columns
df.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,StayDuration
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,10
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,5
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,10
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,5
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,10


In [70]:
df_p = df.copy()
df_p.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,StayDuration
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,10
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,5
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,10
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,5
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,10


In [71]:
# Select numerical columns
numerical_cols = ['Age', 'StayDuration']

# Create a new DataFrame to store polynomial features
poly_df = df_p[numerical_cols].copy()
# Add square terms
for col in numerical_cols:    
    poly_df[f'{col}(months)'] = df_p[col] * 12

# Add interaction terms
for i in range(len(numerical_cols)):
    for j in range(i+1, len(numerical_cols)):
        col1 = numerical_cols[i]
        col2 = numerical_cols[j]
        poly_df[f'{col1}*{col2}'] = df_p[col1] * df_p[col2]

poly_df

Unnamed: 0,Age,StayDuration,Age(months),StayDuration(months),Age*StayDuration
0,34,10,408,120,340
1,32,5,384,60,160
2,28,10,336,120,280
3,45,5,540,60,225
4,50,10,600,120,500
5,29,10,348,120,290
6,32,10,384,120,320
7,25,10,300,120,250
8,32,10,384,120,320
9,27,10,324,120,270


In [72]:
# Drop original numerical columns if necessary
df_p = df_p.drop(columns=numerical_cols)

# Concatenate original DataFrame with polynomial features
df_poly = pd.concat([df_p, poly_df], axis=1)
df_poly.head()


Unnamed: 0,PatientID,Name,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,Age,StayDuration,Age(months),StayDuration(months),Age*StayDuration
0,1,muhammad,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,34,10,408,120,340
1,2,fatima,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,32,5,384,60,160
2,3,ahmed,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,28,10,336,120,280
3,4,aisha,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,45,5,540,60,225
4,5,zainab,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,50,10,600,120,500


---
#### **FINAL CLEAN DATASET**
---

In [73]:
df.head()

Unnamed: 0,PatientID,Name,Age,AdmissionDate,DischargeDate,Gender_f,Gender_m,_covid-19,_dengue,_fracture,_influenza,_malaria,AgeGroup,StayDuration
0,1,muhammad,34,2023-01-15,2023-01-25,False,True,False,False,False,True,False,Middle-aged,10
1,2,fatima,32,2023-02-20,2023-02-25,True,False,True,False,False,False,False,Middle-aged,5
2,3,ahmed,28,2023-03-10,2023-03-20,True,False,False,False,True,False,False,Young,10
3,4,aisha,45,2023-03-27,2023-04-01,False,True,False,False,False,True,False,Middle-aged,5
4,5,zainab,50,2023-05-05,2023-05-15,True,False,True,False,False,False,False,Middle-aged,10


---
### **ANOTHER DATASET**
---

### **RECIPE SITE TRAFFIC**
This dataset appears to contain information about recipes with various attributes:

- recipe: Identifier or name of the recipe.
- calories: Caloric content per serving.
- carbohydrate: Amount of carbohydrates in grams per serving.
- sugar: Sugar content in grams per serving.
- protein: Protein content in grams per serving.
- category: Category or type of recipe (e.g., Pork, Potato, Breakfast, Beverages).
- servings: Number of servings the recipe yields.
- high_traffic: Indicator of high traffic or popularity, possibly in a recipe database.


In [170]:
recipe = pd.read_csv('/kaggle/input/recipe-site-traffic/recipe_site_traffic_2212.csv')
recipe.head()

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein,category,servings,high_traffic
0,1,,,,,Pork,6,High
1,2,35.48,38.56,0.66,0.92,Potato,4,High
2,3,914.28,42.68,3.09,2.88,Breakfast,1,
3,4,97.03,30.56,38.63,0.02,Beverages,4,High
4,5,27.05,1.85,0.8,0.53,Beverages,4,


In [171]:
recipe.describe()

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein
count,947.0,895.0,895.0,895.0,895.0
mean,474.0,435.939196,35.069676,9.046547,24.149296
std,273.519652,453.020997,43.949032,14.679176,36.369739
min,1.0,0.14,0.03,0.01,0.0
25%,237.5,110.43,8.375,1.69,3.195
50%,474.0,288.55,21.48,4.55,10.8
75%,710.5,597.65,44.965,9.8,30.2
max,947.0,3633.16,530.42,148.75,363.36


In [172]:
recipe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 947 entries, 0 to 946
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   recipe        947 non-null    int64  
 1   calories      895 non-null    float64
 2   carbohydrate  895 non-null    float64
 3   sugar         895 non-null    float64
 4   protein       895 non-null    float64
 5   category      947 non-null    object 
 6   servings      947 non-null    object 
 7   high_traffic  574 non-null    object 
dtypes: float64(4), int64(1), object(3)
memory usage: 59.3+ KB


In [173]:
# chack for rows and columns
recipe.shape

(947, 8)

In [174]:
#finding the number of duplicated recipes
recipe.duplicated(subset='recipe').sum()

0

In [175]:
#checking the values of servings column
recipe['servings'].value_counts()

servings
4               389
6               197
2               183
1               175
4 as a snack      2
6 as a snack      1
Name: count, dtype: int64

In [176]:
#replacing the rows including "as a snack" with their relevant numeric number
recipe['servings'] = recipe['servings'].str.replace(" as a snack", "")

#checking the values of servings column again
recipe['servings'].value_counts()

servings
4    391
6    198
2    183
1    175
Name: count, dtype: int64

In [177]:
#converting data type of servings column to integer
recipe['servings'] = recipe['servings'].astype('int')

In [178]:
#checking the values of high_traffic column
recipe['high_traffic'].value_counts()

high_traffic
High    574
Name: count, dtype: int64

In [179]:
#replacing the rows with value "High" with True, and null values with False
recipe['high_traffic'] = np.where(recipe['high_traffic'] == "High", True, False)

#checking the values of high_traffic column again
recipe['high_traffic'].value_counts()

high_traffic
True     574
False    373
Name: count, dtype: int64

In [180]:
#checking the values of category column
recipe['category'].value_counts()

category
Breakfast         106
Chicken Breast     98
Beverages          92
Lunch/Snacks       89
Potato             88
Pork               84
Vegetable          83
Dessert            83
Meat               79
Chicken            74
One Dish Meal      71
Name: count, dtype: int64

In [181]:
#replacing the rows including "Chicken Breast" with empty string
recipe['category'] = recipe['category'].str.replace(" Breast", "")

#checking the values of servings column again
recipe['category'].value_counts()

category
Chicken          172
Breakfast        106
Beverages         92
Lunch/Snacks      89
Potato            88
Pork              84
Vegetable         83
Dessert           83
Meat              79
One Dish Meal     71
Name: count, dtype: int64

In [182]:
#converting data type of category column to category
recipe['category'] = recipe['category'].astype('category')

In [183]:
#checking the summary of dataframe's structure
recipe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 947 entries, 0 to 946
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   recipe        947 non-null    int64   
 1   calories      895 non-null    float64 
 2   carbohydrate  895 non-null    float64 
 3   sugar         895 non-null    float64 
 4   protein       895 non-null    float64 
 5   category      947 non-null    category
 6   servings      947 non-null    int64   
 7   high_traffic  947 non-null    bool    
dtypes: bool(1), category(1), float64(4), int64(2)
memory usage: 46.7 KB


In [184]:
# check missing values
recipe.isnull().sum()

recipe           0
calories        52
carbohydrate    52
sugar           52
protein         52
category         0
servings         0
high_traffic     0
dtype: int64

In [186]:
recipe.dropna(inplace = True)

In [187]:
recipe.isnull().sum()

recipe          0
calories        0
carbohydrate    0
sugar           0
protein         0
category        0
servings        0
high_traffic    0
dtype: int64

In [188]:
recipe.head()

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein,category,servings,high_traffic
1,2,35.48,38.56,0.66,0.92,Potato,4,True
2,3,914.28,42.68,3.09,2.88,Breakfast,1,False
3,4,97.03,30.56,38.63,0.02,Beverages,4,True
4,5,27.05,1.85,0.8,0.53,Beverages,4,False
5,6,691.15,3.46,1.65,53.93,One Dish Meal,2,True


In [189]:
# converting true false to 1/0
recipe['high_traffic'] = recipe['high_traffic'].apply(lambda x: 1 if x else 0)

In [190]:
# Min-Max Scaling for 'calories'
recipe['calories_normalized'] = (recipe['calories'] - recipe['calories'].min()) / (recipe['calories'].max() - recipe['calories'].min())

In [191]:
# Standardization for 'protein'
mean_protein = recipe['protein'].mean()
std_protein = recipe['protein'].std()
recipe['protein_standardized'] = (recipe['protein'] - mean_protein) / std_protein

In [192]:
# Convert category column to numerical using one-hot encoding:
recipe = pd.get_dummies(recipe, columns=['category'], prefix='category')
recipe.head()

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein,servings,high_traffic,calories_normalized,protein_standardized,category_Beverages,category_Breakfast,category_Chicken,category_Dessert,category_Lunch/Snacks,category_Meat,category_One Dish Meal,category_Pork,category_Potato,category_Vegetable
1,2,35.48,38.56,0.66,0.92,4,1,0.009727,-0.638698,False,False,False,False,False,False,False,False,True,False
2,3,914.28,42.68,3.09,2.88,1,0,0.25162,-0.584808,False,True,False,False,False,False,False,False,False,False
3,4,97.03,30.56,38.63,0.02,4,1,0.026669,-0.663444,True,False,False,False,False,False,False,False,False,False
4,5,27.05,1.85,0.8,0.53,4,0,0.007407,-0.649422,True,False,False,False,False,False,False,False,False,False
5,6,691.15,3.46,1.65,53.93,2,1,0.190203,0.818832,False,False,False,False,False,False,True,False,False,False
