<a href="https://colab.research.google.com/github/Surajk86808/ML-code/blob/main/EDA_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [9]:
df = pd.read_csv('/content/Algerian_forest_fires_dataset.csv')

In [10]:
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247 entries, 0 to 246
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          246 non-null    object
 1   month        245 non-null    object
 2   year         245 non-null    object
 3   Temperature  245 non-null    object
 4    RH          245 non-null    object
 5    Ws          245 non-null    object
 6   Rain         245 non-null    object
 7   FFMC         245 non-null    object
 8   DMC          245 non-null    object
 9   DC           245 non-null    object
 10  ISI          245 non-null    object
 11  BUI          245 non-null    object
 12  FWI          245 non-null    object
 13  Classes      244 non-null    object
dtypes: object(14)
memory usage: 27.1+ KB


# Data cleaning

In [15]:
df.isnull().sum()

Unnamed: 0,0
day,1
month,2
year,2
Temperature,2
RH,2
Ws,2
Rain,2
FFMC,2
DMC,2
DC,2


In [32]:
df['Classes'] = df['Classes'].astype(int)

# Task
Explain the error in the selected code. If possible, fix the error and incorporate the changes into the existing code. Otherwise, try to diagnose the error.

## Identify and handle non-numeric values and missing values

### Subtask:
Inspect the ' Classes' column for unique string values and replace them with numerical representations (e.g., 0 for 'not fire' and 1 for 'fire'). Handle any leading/trailing spaces in the column names and values. Address missing values in all relevant columns by either dropping rows or imputing values, based on the data characteristics.


**Reasoning**:
The error indicates that the column name ' Classes' does not exist. This is likely due to leading or trailing spaces in the column names. I will inspect the column names, remove any leading/trailing spaces, then check the unique values in the 'Classes' column to handle spaces and replace the string values with numerical representations. Finally, I will handle missing values by dropping rows.



In [20]:
df.columns = df.columns.str.strip()
print(df.columns)
print(df['Classes'].unique())

df['Classes'] = df['Classes'].str.strip()
print(df['Classes'].unique())

df['Classes'] = df['Classes'].replace(['not fire', 'fire'], [0, 1])
print(df['Classes'].unique())

df.dropna(inplace=True)
print(df.isnull().sum())

Index(['day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes'],
      dtype='object')
['not fire   ' 'fire   ' 'fire' 'fire ' 'not fire' 'not fire ' nan
 'Classes  ' 'not fire     ' 'not fire    ']
['not fire' 'fire' nan 'Classes']
[0 1 nan 'Classes']
day            0
month          0
year           0
Temperature    0
RH             0
Ws             0
Rain           0
FFMC           0
DMC            0
DC             0
ISI            0
BUI            0
FWI            0
Classes        0
dtype: int64


**Reasoning**:
After stripping spaces and replacing 'fire' and 'not fire' with 1 and 0, the 'Classes' column still contains 'Classes' string and NaN values. These rows should be removed as they do not represent valid class labels.



In [21]:
df = df[df['Classes'] != 'Classes']
df = df.dropna()
print(df['Classes'].unique())

[0 1]


## Convert columns to appropriate data types

### Subtask:
Convert the cleaned ' Classes' column and other relevant columns (like 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC', 'DMC', 'DC', 'ISI', 'BUI', and 'FWI') to numeric types to enable numerical analysis.


**Reasoning**:
Convert the identified columns to numeric types and handle any resulting NaNs by dropping the rows.



In [22]:
cols_to_convert = ['day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC', 'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes']

for col in cols_to_convert:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df.dropna(inplace=True)
print(df.isnull().sum())
print(df.info())

day            0
month          0
year           0
Temperature    0
RH             0
Ws             0
Rain           0
FFMC           0
DMC            0
DC             0
ISI            0
BUI            0
FWI            0
Classes        0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 243 entries, 0 to 246
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          243 non-null    int64  
 1   month        243 non-null    int64  
 2   year         243 non-null    int64  
 3   Temperature  243 non-null    int64  
 4   RH           243 non-null    int64  
 5   Ws           243 non-null    int64  
 6   Rain         243 non-null    float64
 7   FFMC         243 non-null    float64
 8   DMC          243 non-null    float64
 9   DC           243 non-null    float64
 10  ISI          243 non-null    float64
 11  BUI          243 non-null    float64
 12  FWI          243 non-null    float64
 13  Classes      2

## Verify data types and missing values

### Subtask:
After cleaning and conversion, verify that all columns have the correct data types and that missing values have been handled effectively.


**Reasoning**:
Verify the data types and check for missing values after cleaning and conversion.



In [23]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 243 entries, 0 to 246
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          243 non-null    int64  
 1   month        243 non-null    int64  
 2   year         243 non-null    int64  
 3   Temperature  243 non-null    int64  
 4   RH           243 non-null    int64  
 5   Ws           243 non-null    int64  
 6   Rain         243 non-null    float64
 7   FFMC         243 non-null    float64
 8   DMC          243 non-null    float64
 9   DC           243 non-null    float64
 10  ISI          243 non-null    float64
 11  BUI          243 non-null    float64
 12  FWI          243 non-null    float64
 13  Classes      243 non-null    int64  
dtypes: float64(7), int64(7)
memory usage: 28.5 KB


Unnamed: 0,0
day,0
month,0
year,0
Temperature,0
RH,0
Ws,0
Rain,0
FFMC,0
DMC,0
DC,0


## Proceed with further analysis

### Subtask:
Proceed with further analysis


In [24]:
df.isnull().sum( )

Unnamed: 0,0
day,0
month,0
year,0
Temperature,0
RH,0
Ws,0
Rain,0
FFMC,0
DMC,0
DC,0


In [31]:
df.columns = df.columns.str.strip()


In [33]:
df[['month','day','year','Temperature','RH']] = df[['month','day','year','Temperature','RH']].astype(int)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 243 entries, 0 to 246
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          243 non-null    int64  
 1   month        243 non-null    int64  
 2   year         243 non-null    int64  
 3   Temperature  243 non-null    int64  
 4   RH           243 non-null    int64  
 5   Ws           243 non-null    int64  
 6   Rain         243 non-null    float64
 7   FFMC         243 non-null    float64
 8   DMC          243 non-null    float64
 9   DC           243 non-null    float64
 10  ISI          243 non-null    float64
 11  BUI          243 non-null    float64
 12  FWI          243 non-null    float64
 13  Classes      243 non-null    int64  
dtypes: float64(7), int64(7)
memory usage: 28.5 KB


In [40]:
columns = [features for features in df.columns if df[features].dtypes=='o']
print(columns)
# The output for columns is empty because all the columns in your DataFrame df have already been converted to numeric data types (int64 or float64).

[]


In [41]:
df.describe()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
count,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0,243.0
mean,15.761317,7.502058,2012.0,32.152263,62.041152,15.493827,0.762963,77.842387,14.680658,49.430864,4.742387,16.690535,7.035391,0.563786
std,8.842552,1.114793,0.0,3.628039,14.82816,2.811385,2.003207,14.349641,12.39304,47.665606,4.154234,14.228421,7.440568,0.496938
min,1.0,6.0,2012.0,22.0,21.0,6.0,0.0,28.6,0.7,6.9,0.0,1.1,0.0,0.0
25%,8.0,7.0,2012.0,30.0,52.5,14.0,0.0,71.85,5.8,12.35,1.4,6.0,0.7,0.0
50%,16.0,8.0,2012.0,32.0,63.0,15.0,0.0,83.3,11.3,33.1,3.5,12.4,4.2,1.0
75%,23.0,8.0,2012.0,35.0,73.5,17.0,0.5,88.3,20.8,69.1,7.25,22.65,11.45,1.0
max,31.0,9.0,2012.0,42.0,90.0,29.0,16.8,96.0,65.9,220.4,19.0,68.0,31.1,1.0
