# **Assignment NO-02 (PANDAS and Machine Leaning)**

## **Questions**:

1. Create a DataFrame with random data and select rows where column 'A' value is greater than 50.

2. Add a new column to the DataFrame based on the condition of another column and fill missing values in that column with the mean of the column.

3. Load a CSV file, group by a column, and calculate the mean of another column.

4. Split the Iris dataset into training and test sets (80% train, 20% test), and train a Support Vector Machine (SVM) classifier. Display the test accuracy.

5. Train a Random Forest classifier using the LFW dataset (face detection data) with a train-test split. Display the test accuracy.

## **Solutions**:

1. Create a DataFrame with random data and select rows where column 'A' value is greater than 50.

**Solution 1**:

In [3]:
import pandas as pd
import numpy as np

# Create a DataFrame with random integers
df = pd.DataFrame(np.random.randint(1, 100, size=(10, 3)), columns=['Math', 'English', 'Physics'])

# Select rows where the value in column 'A' is greater than 50
filtered_df = df[df['Math'] > 50]

# Display the filtered DataFrame
print(filtered_df)


   Math  English  Physics
1    96       75       63
2    56       78       81
3    91        8       11
6    79       96       99
7    52       12       37
8    81       31       61
9    63        5       90


2. Add a new column to the DataFrame based on the condition of another column and fill missing values in that column with the mean of the column.

**Solution 2**:

In [6]:
import pandas as pd
import numpy as np

# Create a DataFrame with random integers, including NaN values
df = pd.DataFrame(np.random.randint(1, 100, size=(10, 3)), columns=['Math', 'English', 'Physics'])
df.loc[2, 'English'] = np.nan  # Introduce a missing value

# Add a new column 'NewColumn' based on a condition of column 'B'
df['Programming'] = np.where(df['English'] > 50, 'High', 'Low')

# Fill missing values in column 'B' with its mean
df['English'].fillna(df['English'].mean(), inplace=True)

# Display the updated DataFrame
print(df)


   Math    English  Physics Programming
0    22  54.000000        8        High
1    89  51.000000       72        High
2    39  45.666667       17         Low
3    71  31.000000       64         Low
4    89  90.000000       95        High
5    34   1.000000       70         Low
6    80  30.000000       14         Low
7    51  72.000000       44        High
8     7  47.000000       17         Low
9    34  35.000000       46         Low


3. Load a CSV file, group by a column, and calculate the mean of another column.

**Solution 3**:

In [9]:
import pandas as pd

# Load data from a CSV file
df = pd.read_csv('file.csv')

# Group by the 'Category' column and calculate the mean of 'Value' column
mean_values = df.groupby('Category')['Value'].mean()

# Display the mean values for each group
print(mean_values)


4. Split the Iris dataset into training and test sets (80% train, 20% test), and train a Support Vector Machine (SVM) classifier. Display the test accuracy.

**Solution 4**:

In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split the dataset into 80% training and 20% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Train a Support Vector Machine classifier
svm_clf = SVC()
svm_clf.fit(X_train, y_train)

# Evaluate and display the accuracy on the test set
print(f"SVM Test Accuracy: {svm_clf.score(X_test, y_test):.2f}")


SVM Test Accuracy: 1.00


5. Train a Random Forest classifier using the LFW dataset (face detection data) with a train-test split. Display the test accuracy.

**Solution 5**:

In [11]:
from sklearn.datasets import fetch_lfw_people
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the LFW dataset (faces of people)
data = fetch_lfw_people(min_faces_per_person=70)

# Split the dataset into training and test sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

# Train a Random Forest classifier on the training data
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# Evaluate and display the accuracy on the test set
print(f"Random Forest Test Accuracy: {rf_clf.score(X_test, y_test):.2f}")


Random Forest Test Accuracy: 0.67
