# AOML_CODE_16 - Regularization Techniques
In this notebook, we aim to predict the salinity of data points, based on the other features. Steps are explained as we go about this process.

### Stage 1: Data Cleaning
This stage focuses on handling missing values and outliers, which are common issues that can affect the quality of the data.

#### Substep 1: Removing Columns with all NULL values
Objective: To eliminate columns that do not contribute to the analysis due to lack of usable data.

Action: Iterate through each column in the dataset and check if all values are null. If so, remove the entire column.

In [1]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('bottle.csv')

# View the first 5 rows
print(df.head())

  df = pd.read_csv('bottle.csv')


   Cst_Cnt  Btl_Cnt       Sta_ID                                Depth_ID  \
0        1        1  054.0 056.0  19-4903CR-HY-060-0930-05400560-0000A-3   
1        1        2  054.0 056.0  19-4903CR-HY-060-0930-05400560-0008A-3   
2        1        3  054.0 056.0  19-4903CR-HY-060-0930-05400560-0010A-7   
3        1        4  054.0 056.0  19-4903CR-HY-060-0930-05400560-0019A-3   
4        1        5  054.0 056.0  19-4903CR-HY-060-0930-05400560-0020A-7   

   Depthm  T_degC  Salnty  O2ml_L  STheta  O2Sat  ...  R_PHAEO  R_PRES  \
0       0   10.50  33.440     NaN  25.649    NaN  ...      NaN       0   
1       8   10.46  33.440     NaN  25.656    NaN  ...      NaN       8   
2      10   10.46  33.437     NaN  25.654    NaN  ...      NaN      10   
3      19   10.45  33.420     NaN  25.643    NaN  ...      NaN      19   
4      20   10.45  33.421     NaN  25.643    NaN  ...      NaN      20   

   R_SAMP  DIC1  DIC2  TA1  TA2  pH2  pH1  DIC Quality Comment  
0     NaN   NaN   NaN  NaN  NaN  

In [4]:
# Remove columns where all the values are null 
df = df.dropna(axis=1, how='all')
df.head()

Unnamed: 0,Cst_Cnt,Btl_Cnt,Sta_ID,Depth_ID,Depthm,T_degC,Salnty,O2ml_L,STheta,O2Sat,...,R_PHAEO,R_PRES,R_SAMP,DIC1,DIC2,TA1,TA2,pH2,pH1,DIC Quality Comment
0,1,1,054.0 056.0,19-4903CR-HY-060-0930-05400560-0000A-3,0,10.5,33.44,,25.649,,...,,0,,,,,,,,
1,1,2,054.0 056.0,19-4903CR-HY-060-0930-05400560-0008A-3,8,10.46,33.44,,25.656,,...,,8,,,,,,,,
2,1,3,054.0 056.0,19-4903CR-HY-060-0930-05400560-0010A-7,10,10.46,33.437,,25.654,,...,,10,,,,,,,,
3,1,4,054.0 056.0,19-4903CR-HY-060-0930-05400560-0019A-3,19,10.45,33.42,,25.643,,...,,19,,,,,,,,
4,1,5,054.0 056.0,19-4903CR-HY-060-0930-05400560-0020A-7,20,10.45,33.421,,25.643,,...,,20,,,,,,,,


#### Substep 2: Handling Missing Values
Objective: To decide how to deal with missing values (e.g., imputation, deletion).

Action:
- Deletion: Remove rows with missing values (there they are too few to significantly impact the dataset).

In [7]:
df_deleted_simple = df.dropna()
df.head()

Unnamed: 0,Cst_Cnt,Btl_Cnt,Sta_ID,Depth_ID,Depthm,T_degC,Salnty,O2ml_L,STheta,O2Sat,...,R_PHAEO,R_PRES,R_SAMP,DIC1,DIC2,TA1,TA2,pH2,pH1,DIC Quality Comment
0,1,1,054.0 056.0,19-4903CR-HY-060-0930-05400560-0000A-3,0,10.5,33.44,,25.649,,...,,0,,,,,,,,
1,1,2,054.0 056.0,19-4903CR-HY-060-0930-05400560-0008A-3,8,10.46,33.44,,25.656,,...,,8,,,,,,,,
2,1,3,054.0 056.0,19-4903CR-HY-060-0930-05400560-0010A-7,10,10.46,33.437,,25.654,,...,,10,,,,,,,,
3,1,4,054.0 056.0,19-4903CR-HY-060-0930-05400560-0019A-3,19,10.45,33.42,,25.643,,...,,19,,,,,,,,
4,1,5,054.0 056.0,19-4903CR-HY-060-0930-05400560-0020A-7,20,10.45,33.421,,25.643,,...,,20,,,,,,,,


### Stage 1: Encoding Categorical Variables
Objective: To convert categorical variables into numerical form suitable for machine learning models.

Action:

- One-Hot Encoding: Create binary columns for each category in a feature.
- Label Encoding: Assign a unique integer to each category.