In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
path = "/content/drive/MyDrive/Datasets for CMPE 255/species.csv"
df = pd.read_csv(path)

**Data Preparation**:
  
  The Data Preparation phase is where the raw data is transformed and readied for modeling. It's often said that data preparation can take up a significant portion of a data science project because the quality of data directly impacts the performance of models.

For this dataset, the Data Preparation phase can be detailed as follows:

1. **Data Cleaning**:

  **Handling Missing Values**:
  The Iris dataset is generally clean, but in real-world scenarios, datasets might have missing values. Strategies to handle them include imputation (filling missing values based on certain criteria) or deletion (removing records with missing values).

  **Removing Duplicates**:
  Ensure that the dataset doesn't have any duplicate entries which could bias the model.

2. **Data Transformation**:
  
  **Feature Engineering**:
Based on the insights from the data understanding phase, new features might be derived or existing ones might be transformed to better capture the underlying patterns. For instance, the ratio of petal length to width might be a useful feature.

  **Normalization/Standardization**:
Some machine learning algorithms, like SVM or k-NN, are sensitive to the scale of features. Features might need to be normalized (scaled to [0,1]) or standardized (scaled to have mean 0 and variance 1).

3. **Data Reduction**:
  
  **Feature Selection**:
If there were many features, redundant or irrelevant features might need to be removed. This isn't typically necessary for the Iris dataset due to its simplicity, but in larger datasets, techniques like Recursive Feature Elimination or feature importance from tree-based algorithms might be employed.

4. **Data Splitting**:
  
  **Training and Testing Sets**:
The dataset should be split into a training set to train the model and a testing set to evaluate its performance. A common split ratio is 80% for training and 20% for testing.

  **Stratified Split**:
Given that the Iris dataset has three equally represented classes, it's essential to ensure that the train-test split maintains this class distribution. Stratified sampling ensures that each split has the same proportion of classes as the whole dataset.
  
5. **Data Encoding**:
  
  **Categorical Variable Encoding**:
Machine learning models require numerical input, so categorical variables (like the species in the Iris dataset) need to be encoded. Common techniques include one-hot encoding or label encoding. Since species is the target variable and it's categorical, it would need to be encoded to numeric labels.

6. **Challenges and Considerations**:
  
  **Overfitting**:
If too many features are used or if the training data is overly processed, the model might perform exceptionally well on the training data but poorly on new, unseen data. This phenomenon is called overfitting.
  
  **Data Leakage**:
Ensure that any transformation or imputation is learned only from the training data and then applied to the testing data. Using information from the test set during the data preparation phase can lead to overly optimistic performance estimates.

In summary, the Data Preparation phase ensures that the dataset is in the best possible format and quality for modeling. Properly prepared data can lead to better model performance and more accurate insights.

In [3]:
df.isnull().sum()

sepal length    0
sepal width     0
petal length    0
petal width     0
class           0
dtype: int64

`Splitting the dataset into training and testing sets is a crucial step in the data preparation process. By doing so, you can train the model on one subset and test it on another, unseen subset, to evaluate its performance.`



In [4]:
from sklearn.model_selection import train_test_split

# Separate the features and the target variable
X = df.drop('class', axis=1)  # Features (sepal and petal measurements)
y = df['class']  # Target variable (species)

# Split the dataset: 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


`With this code:`

`X_train and y_train are the features and target variable for the training set, respectively.`

`X_test and y_test are the features and target variable for the testing set, respectively.`

`test_size=0.2 indicates that 20% of the data will be used for testing.`

`random_state=42 ensures reproducibility. Any integer value can be used, and it ensures that the data is split in the same way every time the code is run.`

`After splitting, you can proceed with training models using the training set (X_train and y_train) and then evaluate their performance using the testing set (X_test and y_test)`


