# CodSoft | Data Science | Internship | Task No # 03

---
## Problem Statement | Task 3 | IRIS Flower Classification

   - The Iris flower dataset consists of three species: setosa, versicolor,and virginica. These species can be distinguished based on theirmeasurements. Now, imagine that you have the measurementsof Iris flowers categorized by their respective species. Your objective is to train a machine learning model that can learn fromthese measurements and accurately classify the Iris flowers into their respective species.

   - Use the Iris dataset to develop a model that can classify iris flowers into different species based on their sepal and petal measurements. This dataset is widely used for introductory classification tasks.

### **Import necessary libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### **Load the IRIS dataset**

In [None]:
di = pd.read_csv("IRIS.csv")

In [None]:
di.head()

In [None]:
di.tail()

In [None]:
di.shape

In [None]:
di.info()

In [None]:
di.describe()

### **Data Exploration and Analysis**
 - **Let's start with some data exploration and visualization.**

### **Visualize the relationship between species and sepal/petal measurements**

In [None]:
# Visualize the relationship between species 
# and sepal/petal measurements

sns.pairplot(di, hue='species')

plt.show()

### **Counts of Species**

In [None]:

plt.figure(figsize=(8, 6))
sns.countplot(data = di, x='species', palette='Set2')
plt.title('Counts of Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

### **Counts of sepal_length**

In [None]:

plt.figure(figsize=(8, 6))
sns.histplot(data= di, x='sepal_length', bins=30, kde=True)
plt.title('Counts of sepal_length')
plt.xlabel('sepal_length')
plt.ylabel('Count')
plt.show()

### **Count of sepal_width**

In [None]:

plt.figure(figsize=(8, 6))
sns.countplot(data=di, x='sepal_width', palette='Set3')
plt.title('Counts of sepal_width')
plt.xlabel('sepal_width')
plt.ylabel('Count')
plt.show()

### **Count of Petal Length**

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(data= di, x='petal_length', bins=30, kde=True)
plt.title('Counts of Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Count')
plt.show()

### **Count of Petal Width**

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(data= di, x='petal_width', bins=30, kde=True)
plt.title('Counts of Petal Width')
plt.xlabel('Petal Width')
plt.ylabel('Count')
plt.show()

### **Scatter Plot**

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(x="sepal_length", y="sepal_width", hue="species", data=di)
plt.show()

### **Line Plot**

In [None]:
plt.figure(figsize=(8,8))
sns.lineplot(x="petal_length", y="petal_width", hue="species", style="species", data=di)
plt.show()

### **Box Plot**


In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(x="sepal_length", y="sepal_width", data=di)
plt.show()

### **Now let's plot the correlation matrix of our data with a heatmap.**

In [None]:
plt.subplots(figsize=(10, 10))
sns.heatmap(di.corr(), cmap = "YlGnBu", annot=True, fmt=".2f")
plt.show()

#### **Missing Value**

In [None]:
# Heatmap
sns.heatmap(di.isnull(),yticklabels = False, cbar = False,cmap = 'tab20c_r')
plt.title('Missing Data: Training Set')
plt.show()

In [None]:
di.isnull().sum()

### **Data Preprocessing**
 - **Handle missing values**
 - **Encode categorical variables**
 - **Feature engineering**

In [None]:
# Remove unnecessary columns
fd.drop(['Name','Ticket'], axis = 1, inplace = True)
fd.drop(['PassengerId','Parch'], axis = 1, inplace = True)
# Remove Cabin feature
fd.drop('Cabin', axis = 1, inplace = True)

# Remove rows with missing data
fd.dropna(inplace = True)

In [None]:
fd.head()

In [None]:
fd.shape

### **Fill missing values in 'Age' and 'Embarked' with appropriate values**

In [None]:
# Fill missing values in 'Age' and 'Embarked' with appropriate values
fd['Age'].fillna(fd['Age'].mean(), inplace=True)
fd['Embarked'].fillna(fd['Embarked'].mode()[0], inplace=True)

In [None]:
fd.head()

### **Encode categorical variables 'Sex' and 'Embarked'**

In [None]:
# Encode categorical variables 'Sex' and 'Embarked'
fd = pd.get_dummies(fd, columns=['Sex', 'Embarked'], drop_first=True)

In [None]:
fd.head()

### **Identify categorical columns for one-hot encoding**

In [None]:
# Identify categorical columns for one-hot encoding
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


categorical_cols = ['Sex', 'Embarked']

### **Identify numeric columns for scaling**

In [None]:
# Identify numeric columns for scaling
numeric_cols = [col for col in fd.columns if col not in categorical_cols and col != 'Survived']

### **Create transformers for preprocessing**

In [None]:
# Create transformers for preprocessing
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(drop='first'))])

### **Use ColumnTransformer to apply transformers to the appropriate columns**

In [None]:
# Use ColumnTransformer to apply transformers to the appropriate columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

### **Define the model**

In [None]:
# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(random_state=42))])

In [None]:
fd.head()

### **Split the data into training and testing sets**

In [None]:
# Split the data into training and testing sets
X = fd.drop(['Survived'], axis=1)
y = fd['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
print("Shape of X_train:" ,X_train.shape)
print("Shape of y_train:" ,y_train.shape)
print("Shape of X_test:" ,X_test.shape)
print("Shape of y_test:" ,y_test.shape)

### **Train a Logistic Regression Model**

In [None]:
# Train a Logistic Regression Model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

In [None]:
# Make Predictions
y_pred = model.predict(X_test)

In [None]:
print(y_pred)
print(y_pred.shape)

### **Evaluate the Model**

In [None]:

print('Classification Model')
# Accuracy
print('--'*30)
logreg_accuracy = round(accuracy_score(y_test, y_pred) * 100,2)
print('Accuracy', logreg_accuracy,'%')

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", confusion)


In [None]:
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

In [None]:
sns.scatterplot(x=y_test, y=y_pred, color='blue', label='Actual Data points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label='Ideal Line')
plt.legend()
plt.show()

In [None]:
# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# Printing the results
print("Actual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"{actual:14.2f} |  {predicted:12.2f}")