# Lab Exam - Set 6

This notebook contains implementations for all questions in Set 6.

 21: Pandas DataFrame - CSV Import with apply() and map() Transformations

In [1]:
df = pd.read_csv("data.csv")

# Using apply() and map()
df['price_with_tax'] = df['price'].apply(lambda x: x * 1.05)
df['category_upper'] = df['category'].map(lambda x: x.upper())

print(df.head())


NameError: name 'pd' is not defined

## Question 22: Data Preprocessing - Missing Data, Outliers, and Standardization

**Concepts:**
- **Missing data**: Empty or null entries (handled with `fillna()`, `dropna()`, imputation)
- **Outliers**: Data points significantly different from others (detected using IQR, Z-score)
- **IQR (Interquartile Range)**: Q3 - Q1, used to find outliers (values beyond Q1-1.5*IQR or Q3+1.5*IQR)
- **Standardization**: Scaling features to have mean=0 and std=1 using formula: (x - mean) / std
- **StandardScaler**: Sklearn tool for standardization (important for algorithms sensitive to scale)

In [None]:
from sklearn.preprocessing import StandardScaler

# Fill missing values with mean
df.fillna(df.mean(), inplace=True)

# Detect outliers (same IQR method)
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1

# Standardization
scaler = StandardScaler()
df[['value']] = scaler.fit_transform(df[['value']])

print(df.head())


## Question 23: Data Visualization - Line Plot and Scatter Plot with Regression Line

**Concepts:**
- **Line plot**: Graph showing trends over continuous data (useful for time series, sequential data)
- **Scatter plot**: Graph showing relationship between two variables
- **Regression line**: Best-fit line showing linear relationship between variables
- **Linear regression**: Statistical method to model relationship: y = mx + c
- **Correlation**: Measure of how strongly variables are related (-1 to +1)
- **Matplotlib/Seaborn**: Python libraries for data visualization

In [None]:
sns.lmplot(x='feature1', y='feature2', data=df)
plt.title("Scatter Plot with Regression Line")
plt.show()

df[['feature1', 'feature2']].plot(kind='line')
plt.title("Line Plot Showing Trend")
plt.show()


## Question 24: K-Nearest Neighbors (KNN) Classifier - Compare Different k-values

**Concepts:**
- **KNN**: Supervised learning algorithm that classifies based on k nearest neighbors
- **k-value**: Number of nearest neighbors to consider (hyperparameter)
- **How KNN works**: Finds k closest points, uses majority vote for classification
- **Distance metric**: Usually Euclidean distance to find nearest neighbors
- **Choosing k**: Small k = more complex (overfitting), Large k = simpler (underfitting)
- **Train-test split**: Divide data to train model and evaluate performance
- **Accuracy**: Percentage of correct predictions

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

X = df[['x1', 'x2']]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

accuracies = []
k_values = range(1, 11)

for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

plt.plot(k_values, accuracies, marker='o')
plt.title("KNN Accuracy vs K Values")
plt.xlabel("K")
plt.ylabel("Accuracy")
plt.show()
