<a href="https://colab.research.google.com/github/SandhyaGiribabu/Machine_Learning_Lab/blob/main/Experiments/Experiment_1/Experiment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Experiment 1: Working with Python packages – Numpy, Scipy, Scikit-learn, Matplotlib**

**1. Explore the various functions and methods available in the following Python libraries: Numpy,
Pandas, Scipy, Scikit-learn, Matplotlib. Understand the key operations such as array
manipulations, data preprocessing, mathematical computing, machine learning workflows,
and data visualization**

The following is a brief overview of the key functions, methods, and operations available in the Python libraries: NumPy, Pandas, SciPy, Scikit-learn, and Matplotlib. These libraries are essential for performing tasks related to data analysis, machine learning, numerical computing, and visualization.

**1. NumPy (Numerical Python)
NumPy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.**

Key Operations and Methods:

Array Creation: array(), zeros(), ones(), arange(), linspace()

Array Manipulation: reshape(), flatten(), transpose(), concatenate(), stack()

Mathematical Computations: sum(), mean(), std(), dot(), functions from numpy.linalg for linear algebra

Indexing and Slicing: Support for advanced slicing, masking, and broadcasting

**2. Pandas (Python Data Analysis Library)
Pandas offers data structures and operations for manipulating numerical tables and time series data, primarily through its Series and DataFrame objects.**

Key Operations and Methods:

Data Structures: Series (1D), DataFrame (2D)

Data Loading: read_csv(), read_excel(), read_json()

Data Inspection: head(), info(), describe(), shape

Data Selection: loc[], iloc[], filtering using boolean indexing

Data Cleaning: dropna(), fillna(), replace(), astype()

Aggregation: groupby(), agg(), pivot_table()

Merging and Joining: merge(), concat(), join()

**3. SciPy (Scientific Python)
SciPy is built on top of NumPy and provides modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and other scientific computations.**

Key Modules and Functions:

Linear Algebra: scipy.linalg – advanced operations like matrix inversion, solving systems of equations

Optimization: scipy.optimize – for finding minima/maxima and solving equations

Statistics: scipy.stats – probability distributions, statistical tests

Signal Processing: scipy.signal

Numerical Integration: scipy.integrate – for definite integrals and differential equations

**4. Scikit-learn (Machine Learning Library)
Scikit-learn is a robust library for machine learning, offering simple and efficient tools for data mining and data analysis.**

Key Operations and Components:

Preprocessing: StandardScaler, LabelEncoder, OneHotEncoder, train_test_split()

Models and Algorithms: LinearRegression, LogisticRegression, SVC, DecisionTreeClassifier, KMeans, etc.

Model Fitting: .fit() method

Prediction: .predict() method

Evaluation: accuracy_score(), confusion_matrix(), classification_report(), cross_val_score()

Pipelines: Used to streamline workflows by chaining preprocessing and modeling steps

**5. Matplotlib (Plotting and Visualization Library)
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.**

Key Functions and Methods:

Basic Plotting: plot(), scatter(), bar(), hist(), pie()

Axis Labels and Titles: xlabel(), ylabel(), title(), legend(), grid()

Subplots: subplot(), subplots() – for arranging multiple plots in one figure

Saving Figures: savefig() – to export plots to image files

Style Customization: Support for themes and plot aesthetics using style.use()



**2. Explore public repositories such as the UCI Machine Learning Repository (UCI Repository) and Kaggle Datasets. Download the following datasets and identify the appropriate machine learning model to be used (e.g., Supervised, Unsupervised, Semi-supervised, Regression, Classification) [CO1, K3].**

**i.) Loan amount prediction**

In [None]:
!pip install pandas scikit-learn seaborn opendatasets --quiet


In [None]:
import opendatasets as od

# Automatically downloads and unzips to local directory
od.download("https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset")

import pandas as pd

# Load training data
loan_df = pd.read_csv("loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv")
print(loan_df.head())


Skipping, found downloaded files in "./loan-prediction-problem-dataset" (use force=True to force download)
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0    

Since we are having corresponding labels for each input, it is a supervised learning problem. Also the output values are numbers so it is regression problem.

**ii.) Handwritten character recognition**

In [None]:
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
print("MNIST training shape:", x_train.shape)
print(x_train)

# Flatten images and convert to float
flat_x = x_train.reshape(x_train.shape[0], -1).astype(float)  # shape: (60000, 784)
mnist_df = pd.DataFrame(flat_x)
mnist_df['label'] = y_train

# Sample 1000 rows for faster plotting
mnist_df_sample = mnist_df.sample(n=1000, random_state=42)


MNIST training shape: (60000, 28, 28)
[[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 ...

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]


Since we are having corresponding labels for each input, it is a supervised learning problem. Also the output values are classes (i.e., the character) it is a multiple class classification problem.

**iii.) Classification of Email spam**


In [None]:
import pandas as pd

spam_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
spam_cols = [f"feature_{i}" for i in range(57)] + ["label"]
spam_df = pd.read_csv(spam_url, header=None, names=spam_cols)

print("Spam data shape:", spam_df.shape)
print(spam_df)

Spam data shape: (4601, 58)
      feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \
0          0.00       0.64       0.64        0.0       0.32       0.00   
1          0.21       0.28       0.50        0.0       0.14       0.28   
2          0.06       0.00       0.71        0.0       1.23       0.19   
3          0.00       0.00       0.00        0.0       0.63       0.00   
4          0.00       0.00       0.00        0.0       0.63       0.00   
...         ...        ...        ...        ...        ...        ...   
4596       0.31       0.00       0.62        0.0       0.00       0.31   
4597       0.00       0.00       0.00        0.0       0.00       0.00   
4598       0.30       0.00       0.30        0.0       0.00       0.00   
4599       0.96       0.00       0.00        0.0       0.32       0.00   
4600       0.00       0.00       0.65        0.0       0.00       0.00   

      feature_6  feature_7  feature_8  feature_9  ...  feature_48  feature_49  \
0 

Since we are having corresponding labels for each input, it is a supervised learning problem. Also the output values are classes (0/1 - not spam/spam) it is a classification problem.

**iv.) Predicting Diabetes**

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd

diabetes = load_diabetes()
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target

print(diabetes_df.head())


        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  target  
0 -0.002592  0.019907 -0.017646   151.0  
1 -0.039493 -0.068332 -0.092204    75.0  
2 -0.002592  0.002861 -0.025930   141.0  
3  0.034309  0.022688 -0.009362   206.0  
4 -0.002592 -0.031988 -0.046641   135.0  


Since we are having corresponding labels for each input, it is a supervised learning problem. Also the output values are numeric values it is a regression problem.

**(v) Iris Dataset**

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

print(iris_df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


Since we are having corresponding labels for each input, it is a supervised learning problem. Also the output values are classes (type of flower) it is a classification problem.