# Lab Exam - Set 5

This notebook contains implementations for all questions in Set 5.

## Question 17: NumPy Arrays - Arithmetic and Statistical Operations with Universal Functions

**Concepts:**
- **NumPy Arrays**: Efficient multi-dimensional containers for numerical data
- **Universal Functions (ufuncs)**: Fast element-wise operations on arrays
- **Arithmetic operations**: Addition, subtraction, multiplication, division, power, modulo
- **Statistical operations**: Mean, median, standard deviation, variance, min, max, percentiles
- **Broadcasting**: NumPy's ability to perform operations on arrays of different shapes
- **Aggregation functions**: Sum, product, cumulative sum, cumulative product

In [None]:
import numpy as np

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Arithmetic operations
print(arr1 + arr2)
print(arr1 * arr2)

# Statistical operations
print(np.mean(arr1))
print(np.std(arr1))

## Question 18: Pandas DataFrame - Import CSV, Drop Duplicates, and Group-wise Statistics

**Concepts:**
- **CSV Import**: Reading comma-separated values files using `pd.read_csv()`
- **Duplicate Records**: Rows with identical values in all or specific columns
- **drop_duplicates()**: Method to remove duplicate rows from DataFrame
- **GroupBy**: Splitting data into groups based on criteria
- **Aggregation**: Computing summary statistics for each group
- **Group-wise statistics**: Mean, sum, count, min, max for different categories

In [None]:
df = pd.read_csv("data.csv")

# Drop duplicates
df = df.drop_duplicates()

# Group-wise mean
group_stats = df.groupby('category')['value'].mean()
print(group_stats)


## Question 19: Overlapping Histograms - Compare Distributions Among Features

**Concepts:**
- **Histogram**: Graphical representation of data distribution using bins
- **Overlapping plots**: Multiple distributions on same axes for comparison
- **Distribution comparison**: Analyzing shape, center, and spread of different features
- **Alpha transparency**: Making overlapping plots visible using transparency
- **Matplotlib**: Python library for creating visualizations
- **Seaborn**: Statistical data visualization library built on matplotlib

In [None]:
df[['feature1', 'feature2']].plot(kind='hist', alpha=0.5, bins=20)
plt.title("Overlapping Histograms")
plt.show()


## Question 20: Logistic Regression for Binary Classification - Confusion Matrix and ROC Curve

**Concepts:**
- **Logistic Regression**: Supervised learning algorithm for binary classification
- **Binary Classification**: Predicting one of two possible outcomes (0 or 1, Yes or No)
- **Confusion Matrix**: Table showing True Positives, True Negatives, False Positives, False Negatives
- **ROC Curve**: Receiver Operating Characteristic curve showing TPR vs FPR
- **AUC**: Area Under the ROC Curve - measures model performance (0.5 to 1.0)
- **Accuracy**: (TP + TN) / Total predictions
- **Precision**: TP / (TP + FP)
- **Recall**: TP / (TP + FN)
- **F1-Score**: Harmonic mean of precision and recall

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

X = df[['x1', 'x2']]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Confusion Matrix
print(confusion_matrix(y_test, y_pred))

# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr)
plt.title("ROC Curve")
plt.show()
