# Lab Exam - Set 1

This notebook contains implementations for all questions in Set 1.

## Question 1: NumPy Arrays - Creation, Reshaping, and Joining

**Concepts:**
- **NumPy**: Python library for numerical computing with arrays
- **Intrinsic objects**: Built-in NumPy functions like `arange()`, `zeros()`, `ones()`
- **Random functions**: Functions to generate random numbers
- **Reshaping**: Changing array dimensions without changing data
- **Joining**: Combining multiple arrays (concatenate, stack, etc.)

In [None]:
import numpy as np

# Creating arrays using intrinsic objects
arr1 = np.arange(12)  # Array from 0 to 11
arr2 = np.zeros((3, 4))  # 3x4 array of zeros
arr3 = np.ones((2, 6))  # 2x6 array of ones

# Creating arrays using random functions
arr4 = np.random.randint(0, 100, size=12)  # Random integers
arr5 = np.random.rand(3, 4)  # Random floats between 0 and 1

print("Array 1 (arange):", arr1)
print("\nArray 2 (zeros):\n", arr2)
print("\nArray 4 (random integers):", arr4)

# Reshaping operations
reshaped1 = arr1.reshape(3, 4)  # Convert 1D to 3x4
reshaped2 = arr4.reshape(4, 3)  # Convert 1D to 4x3
print("\nReshaped arr1 to 3x4:\n", reshaped1)
print("\nReshaped arr4 to 4x3:\n", reshaped2)

# Joining operations
# Concatenate along axis 0 (rows)
concat_vertical = np.concatenate([reshaped1, arr2], axis=0)
print("\nVertical concatenation (6x4):\n", concat_vertical)

# Stack arrays horizontally
hstack_result = np.hstack([reshaped1, arr5])
print("\nHorizontal stack (3x8):\n", hstack_result)

# Vertical stack
vstack_result = np.vstack([reshaped1, arr2])
print("\nVertical stack (6x4):\n", vstack_result)

## Question 2: Pandas DataFrame - CSV Import and Data Cleaning

**Concepts:**
- **DataFrame**: 2D labeled data structure in Pandas (like Excel table)
- **Missing values**: Empty or null entries in data (handled with `fillna()`, `dropna()`)
- **Outliers**: Data points significantly different from others (detected using IQR method)
- **Statistical summaries**: Mean, median, standard deviation, etc. using `describe()`
- **IQR (Interquartile Range)**: Q3 - Q1, used to find outliers

In [None]:
import pandas as pd
import numpy as np

# Create a sample CSV file for demonstration
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Age': [25, 30, np.nan, 28, 35, 22, 150, 29],  # 150 is outlier, one missing
    'Salary': [50000, 60000, 55000, np.nan, 70000, 48000, 62000, 200000],  # 200000 outlier
    'Score': [85, 90, 78, 88, np.nan, 92, 87, 89]
}
df_sample = pd.DataFrame(data)
df_sample.to_csv('sample_data.csv', index=False)

# Import CSV file
df = pd.read_csv('sample_data.csv')
print("Original DataFrame:")
print(df)

# Handle missing values
print("\n\nMissing values count:")
print(df.isnull().sum())

# Fill missing values with mean for numeric columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
df['Score'].fillna(df['Score'].mean(), inplace=True)

print("\n\nDataFrame after handling missing values:")
print(df)

# Detect outliers using IQR method
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

print("\n\nOutlier Detection:")
for col in ['Age', 'Salary', 'Score']:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    print(f"\n{col} - Bounds: [{lower:.2f}, {upper:.2f}]")
    if not outliers.empty:
        print(f"Outliers found:\n{outliers[['Name', col]]}")
    else:
        print("No outliers found")

# Statistical summaries
print("\n\nStatistical Summary:")
print(df.describe())

## Question 3: Data Visualization - Histograms and Density Plots

**Concepts:**
- **Histogram**: Bar chart showing frequency distribution of data
- **Density plot (KDE)**: Smooth curve showing probability distribution
- **Distribution patterns**: Shape of data (normal, skewed, bimodal, etc.)
- **Matplotlib**: Python library for creating visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Use the dataframe from previous question
# Plot histograms and density plots for numeric columns
numeric_cols = ['Age', 'Salary', 'Score']

fig, axes = plt.subplots(3, 2, figsize=(12, 10))
fig.suptitle('Distribution Analysis - Histograms and Density Plots', fontsize=16)

for idx, col in enumerate(numeric_cols):
    # Histogram
    axes[idx, 0].hist(df[col], bins=10, color='skyblue', edgecolor='black', alpha=0.7)
    axes[idx, 0].set_title(f'{col} - Histogram')
    axes[idx, 0].set_xlabel(col)
    axes[idx, 0].set_ylabel('Frequency')
    axes[idx, 0].grid(True, alpha=0.3)
    
    # Density plot (KDE)
    df[col].plot(kind='density', ax=axes[idx, 1], color='coral', linewidth=2)
    axes[idx, 1].set_title(f'{col} - Density Plot')
    axes[idx, 1].set_xlabel(col)
    axes[idx, 1].set_ylabel('Density')
    axes[idx, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Additional: Combined histogram with density plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, col in enumerate(numeric_cols):
    axes[idx].hist(df[col], bins=10, density=True, alpha=0.6, color='skyblue', edgecolor='black')
    df[col].plot(kind='density', ax=axes[idx], color='red', linewidth=2)
    axes[idx].set_title(f'{col} Distribution')
    axes[idx].set_xlabel(col)
    axes[idx].legend(['Density', 'Histogram'])
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Question 4: K-Nearest Neighbors (KNN) Classifier on Iris Dataset

**Concepts:**
- **KNN**: Supervised learning algorithm that classifies based on nearest neighbors
- **Iris dataset**: Famous dataset with 3 flower species and 4 features
- **Train-test split**: Dividing data into training and testing sets
- **Confusion matrix**: Table showing correct vs incorrect predictions
- **Accuracy**: Percentage of correct predictions

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import seaborn as sns

# Load Iris dataset
iris = load_iris()
X = iris.data  # Features: sepal length, sepal width, petal length, petal width
y = iris.target  # Target: 0=setosa, 1=versicolor, 2=virginica

print("Dataset shape:", X.shape)
print("Classes:", iris.target_names)
print("\nFirst 5 samples:")
print(X[:5])

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

# Create and train KNN classifier (k=3)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\n\nAccuracy: {accuracy * 100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names)
plt.title('Confusion Matrix - KNN Classifier')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Classification Report
print("\n\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))