# Data Science Learning Notebook
This notebook will teach you the basics of numpy, pandas, matplotlib, and seaborn. We will generate fake train and test data, perform data manipulation, and visualize the data.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for seaborn
sns.set(style="whitegrid")

## Numpy Basics
Numpy is a powerful library for numerical computing in Python. It provides support for arrays, matrices, and many mathematical functions.

In [None]:
# Create a numpy array
array = np.array([1, 2, 3, 4, 5])
print("Numpy Array:", array)

# Create a 2D numpy array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Numpy Matrix:\n", matrix)

# Perform basic operations
print("Sum of array:", np.sum(array))
print("Mean of matrix:", np.mean(matrix))

## Pandas Basics
Pandas is a powerful library for data manipulation and analysis. It provides data structures like Series and DataFrame.

In [None]:
# Create a pandas Series
series = pd.Series([1, 2, 3, 4, 5])
print("Pandas Series:\n", series)

# Create a pandas DataFrame
data = {'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print("Pandas DataFrame:\n", df)

# Perform basic operations
print("Sum of column A:", df['A'].sum())
print("Mean of DataFrame:", df.mean())

## Generating Fake Train and Test Data
We will generate fake train and test data using numpy and pandas.

In [None]:
# Generate fake data
np.random.seed(0)  # For reproducibility
train_data = np.random.randn(100, 4)  # 100 rows, 4 columns
test_data = np.random.randn(20, 4)  # 20 rows, 4 columns

# Create pandas DataFrames
train_df = pd.DataFrame(train_data, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])
test_df = pd.DataFrame(test_data, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])

print("Train DataFrame:\n", train_df.head())
print("Test DataFrame:\n", test_df.head())

## Data Manipulation
We will perform some basic data manipulation using pandas.

In [None]:
# Add a new column to train DataFrame
train_df['Target'] = np.random.randint(0, 2, size=100)  # Binary target variable
print("Train DataFrame with Target:\n", train_df.head())

# Filter rows where Target is 1
filtered_df = train_df[train_df['Target'] == 1]
print("Filtered DataFrame (Target=1):\n", filtered_df.head())

In [None]:
# Add new features based on existing ones
train_df['Feature_Sum'] = train_df['Feature1'] + train_df['Feature2']
train_df['Feature_Diff'] = train_df['Feature3'] - train_df['Feature4']
train_df['Feature_Product'] = train_df['Feature1'] * train_df['Feature3']
train_df['Feature_Ratio'] = train_df['Feature2'] / (train_df['Feature4'] + 1e-5)  # Adding a small value to avoid division by zero

print("Train DataFrame with Intermediate Features:\n", train_df.head())

## Data Visualization with Matplotlib and Seaborn
We will visualize the data using matplotlib and seaborn.

In [None]:
# Plot a histogram of Feature1
plt.figure(figsize=(8, 6))
plt.hist(train_df['Feature1'], bins=20, color='blue', alpha=0.7)
plt.title('Histogram of Feature1')
plt.xlabel('Feature1')
plt.ylabel('Frequency')
plt.show()

# Plot a seaborn heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = train_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

# Plot a pairplot of the train DataFrame
sns.pairplot(train_df, hue='Target')
plt.show()

## Using Sklearn for Model Training and Evaluation
We will use sklearn to train a simple model and evaluate its performance.

In [None]:
# Import necessary libraries from sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Split the data into train and validation sets
X = train_df[['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature_Sum', 'Feature_Diff', 'Feature_Product', 'Feature_Ratio']]
y = train_df['Target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
conf_matrix = confusion_matrix(y_val, y_pred)
class_report = classification_report(y_val, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Diffrent Models can produce diffrenct accuracys. Using train test split ensure that the accuracy is acurate 

In [34]:
# Import necessary libraries from sklearn
from sklearn.tree import DecisionTreeClassifier

# Train a Decision Tree Classifier model
dt_model = DecisionTreeClassifier(random_state=0)
dt_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_dt = dt_model.predict(X_val)

# Evaluate the Decision Tree model
accuracy_dt = accuracy_score(y_val, y_pred_dt)
conf_matrix_dt = confusion_matrix(y_val, y_pred_dt)
class_report_dt = classification_report(y_val, y_pred_dt)

print("Decision Tree Classifier Accuracy:", accuracy_dt)
print("Decision Tree Classifier Confusion Matrix:\n", conf_matrix_dt)
print("Decision Tree Classifier Classification Report:\n", class_report_dt)

Decision Tree Classifier Accuracy: 0.5
Decision Tree Classifier Confusion Matrix:
 [[6 5]
 [5 4]]
Decision Tree Classifier Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.55      0.55        11
           1       0.44      0.44      0.44         9

    accuracy                           0.50        20
   macro avg       0.49      0.49      0.49        20
weighted avg       0.50      0.50      0.50        20



In [None]:
# Import necessary libraries from sklearn
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest Classifier model
rf_model = RandomForestClassifier(random_state=0)
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_rf = rf_model.predict(X_val)

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_val, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_val, y_pred_rf)
class_report_rf = classification_report(y_val, y_pred_rf)

print("Random Forest Classifier Accuracy:", accuracy_rf)
print("Random Forest Classifier Confusion Matrix:\n", conf_matrix_rf)
print("Random Forest Classifier Classification Report:\n", class_report_rf)