### <b><u> MACHINE LEARNING </u></b>

It is defined as the science of getting computers to perform specific tasks that a human being would perform. It is a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. 

#### Types of Machine Learning:

1. **Supervised Learning**: In this type of learning, the model is trained on a labeled dataset, which means that each training example is paired with an output label. The model learns to map inputs to the correct output based on the provided labels. Common algorithms include linear regression, logistic regression, and support vector machines.

2. **Unsupervised Learning**: In unsupervised learning, the model is given data without explicit instructions on what to do with it. The goal is to find hidden patterns or intrinsic structures in the input data. Common algorithms include clustering (like K-means) and dimensionality reduction (like PCA).

3. **Reinforcement Learning**: This type of learning involves an agent that learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. The agent receives feedback in the form of rewards or penalties based on its actions. Common algorithms include Q-learning and deep reinforcement learning.

It can also have other types like Semi-supervised learning and Self-supervised learning.

4. **Semi-supervised Learning**: This type of learning falls between supervised and unsupervised learning. It uses a small amount of labeled data along with a large amount of unlabeled data to improve learning accuracy. This approach is particularly useful when labeling data is expensive or time-consuming.

5. **Self-supervised Learning**: In self-supervised learning, the model generates its own labels from the input data. It learns to predict part of the data from other parts, effectively creating a supervised learning problem from an unsupervised dataset. This approach is commonly used in natural language processing and computer vision.

#### Applications of Machine Learning:

1. **Image Recognition**: Machine learning algorithms can be used to recognize and classify objects in images, such as faces, objects, and scenes.

2. **Speech Recognition**: Machine learning algorithms can be used to convert speech into text and vice versa.

3. **Natural Language Processing**: Machine learning algorithms can be used to analyze and understand natural language, such as sentiment analysis and machine translation.

4. **Recommendation Systems**: Machine learning algorithms can be used to recommend products or services to users based on their preferences and behavior.

5. **Anomaly Detection**: Machine learning algorithms can be used to detect anomalies or outliers in data, such as fraudulent transactions or anomalies in sensor data.

6. **Fraud Detection**: Machine learning algorithms can be used to detect fraudulent activities, such as credit card fraud or identity theft.

7. **Malware Detection**: Machine learning algorithms can be used to detect and prevent malware infections.

8. **Customer Segmentation**: Machine learning algorithms can be used to segment customers into different groups based on their behavior, demographics, or preferences.

9. **Recommendation Systems**: Machine learning algorithms can be used to recommend products or services to users based on their preferences and behavior.

10. **Predictive Maintenance**: Machine learning algorithms can be used to predict when equipment or machinery is likely to fail, allowing for proactive maintenance and reducing downtime.

#### Popular Machine Learning Libraries and Frameworks:

1. **TensorFlow**: A popular open-source machine learning library for deep learning.

2. **PyTorch**: A popular open-source machine learning library for deep learning.

3. **Scikit-learn**: A popular open-source machine learning library for supervised and unsupervised learning.

4. **Keras**: A popular open-source machine learning library for deep learning.

5. **XGBoost**: A popular open-source machine learning library for gradient boosting.

6. **LightGBM**: A popular open-source machine learning library for gradient boosting.

7. **CatBoost**: A popular open-source machine learning library for gradient boosting.

8. **Statsmodels**: A popular open-source machine learning library for statistical analysis.

9. **OpenCV**: A popular open-source machine learning library for computer vision.

10. **Matplotlib**: A popular open-source machine learning library for data visualization.

11. **Seaborn**: A popular open-source machine learning library for data visualization.

12. **NLTK**: A popular open-source machine learning library for natural language processing.

13. **SpaCy**: A popular open-source machine learning library for natural language processing.

14. **Hugging Face Transformers**: A popular open-source library for natural language processing and transformer models.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer 

In [3]:
cancer_data=load_breast_cancer(as_frame=True)

In [11]:
X=cancer_data.data       # data function is used to get the data from the dataset or else we cannot use other functions like head,tail,describe,etc
X.head(1)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189


In [12]:
y=cancer_data.target    # target function is used to get the labels of the dataset
y.head(1)

0    0
Name: target, dtype: int64

### Supervised Machine Learning Overview

- **Supervised Learning** involves a model learning from existing data and the corresponding labels.
- The model predicts labels based on input features.

#### Types of Labels in Supervised Learning:

1. **Continuous Labels**:
   - Example: Predicting the price of cars based on features like model, mileage, etc.
   - The output is a continuous value.
   
2. **Categorical Labels**:
   - Example: In a breast cancer dataset, labels could be **benign (0)** or **malignant (1)**.
   - The output is a category or class.

#### Classification Task:
- When the label or target to predict is a **categorical value**, it's called a **classification task**.
- The model is known as a **classifier**, and it tries to categorize the input into specific classes.


In [None]:
y.unique()

array([0, 1])

In [13]:
X.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [14]:
# X.info()

### Preparing Training and Test Datasets

#### Key Concepts:
- **Training a model** on data involves using labeled data to teach the model how to make predictions.
- **Making predictions** involves using the trained model to predict labels for unseen data.

##### Terminology:
1. **Training Data** (or **Training Set**): 
   - The data used to train the model.
   
2. **Test Data** (or **Test Set**): 
   - The unseen data that the model has not trained on. 
   - It's used to evaluate the model's performance by predicting labels for this data.

#### Splitting the Dataset:

- Since we don't always have a separate test set, we can split our original dataset into a **training set** and a **test set**.
- We'll use `scikit-learn`'s `train_test_split` function for this.

#### Steps:

1. **Inputs**:
   - The function takes **features** (input data) and **labels** (target data) as inputs.
   - Conventionally, features are denoted by **X** and labels by **y**.
   
2. **Proportion**:
   - We decide the proportion for the test set (usually 15-20%).
   - This can vary based on the size of the dataset.

3. **Random Splitting**:
   - The dataset is randomly split into training and test sets.
   
#### Output of `train_test_split()`:
- **Training Set Features** (X_train)
- **Test Set Features** (X_test)
- **Training Set Labels** (y_train)
- **Test Set Labels** (y_test)

#### Data Preparation:
- Additional steps like **data cleaning** are often needed before training a model. We’ll learn more about this in future lessons.


In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [18]:
X_train.shape

(455, 30)

In [19]:
X_train.shape[0]/X.shape[0]*100  # to check the split ratio

79.96485061511423

In [21]:
X_test.shape[0]/X.shape[0]*100  # to check the split ratio

20.035149384885763

### Building and Training a Classifier

In [22]:
from sklearn.svm import LinearSVC
model=LinearSVC(penalty='l2',loss='squared_hinge',C=10,random_state=417)

In [23]:
model.fit(X_train,y_train)

In [24]:
X_test.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
204,12.47,18.6,81.09,481.9,0.09965,0.1058,0.08005,0.03821,0.1925,0.06373,...,14.97,24.64,96.05,677.9,0.1426,0.2378,0.2671,0.1015,0.3014,0.0875
70,18.94,21.31,123.6,1130.0,0.09009,0.1029,0.108,0.07951,0.1582,0.05461,...,24.86,26.58,165.9,1866.0,0.1193,0.2336,0.2687,0.1789,0.2551,0.06589
131,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,...,19.26,26.0,124.9,1156.0,0.1546,0.2394,0.3791,0.1514,0.2837,0.08019
431,12.4,17.68,81.47,467.8,0.1054,0.1316,0.07741,0.02799,0.1811,0.07102,...,12.88,22.91,89.61,515.8,0.145,0.2629,0.2403,0.0737,0.2556,0.09359
540,11.54,14.44,74.65,402.9,0.09984,0.112,0.06737,0.02594,0.1818,0.06782,...,12.26,19.68,78.78,457.8,0.1345,0.2118,0.1797,0.06918,0.2329,0.08134


In [25]:
y_test.head()

204    1
70     0
131    0
431    1
540    1
Name: target, dtype: int64

In [29]:
y_predicted=model.predict(X_test)

In [30]:
model.predict(X_test.iloc[[1]])

array([0])

In [31]:
y_test==y_predicted

204     True
70      True
131     True
431     True
540     True
       ...  
486     True
75      True
249     True
238    False
265     True
Name: target, Length: 114, dtype: bool

In [32]:
(y_test==y_predicted).sum()/y_test.shape[0]*100  # accuracy calculation

np.float64(95.6140350877193)

In [33]:
import pickle
with open('newmodel.pkl','wb') as f:
    pickle.dump(model,f)

In [34]:
with open('newmodel.pkl','rb') as f:
    kaif=pickle.load(f)

In [35]:
kaif.predict(X_test.iloc[[1]])

array([0])

## Model Improvement and Key Considerations

### Increased Accuracy:
- We improved our model’s accuracy by tweaking a few parameters.
- Our model is now better at predicting whether a patient has breast cancer.

#### Achievements:
- Trained a model using **scikit-learn**.
- Experimented with a few parameters.
- Achieved reasonably high accuracy.

#### However, We Did This Without:
- Understanding what the model actually does.
- Knowing what each parameter controls.
- Understanding most of the dataset’s features.

### The Bigger Picture:

#### Parameter Sensitivity:
- We set `max_iter` to **3500**, but if we had set it to **3000**, our accuracy would have dropped significantly.
- Imagine experimenting with 12 parameters in `scikit-learn`'s **LinearSVC**.
    - How many permutations and combinations could we try?
    - What if we lowered **C** instead of increasing it? What does **C** control in the model?
    - Would lowering it improve performance on the test set?

#### Data and Model Considerations:
- **New Data**: If we get more data with new features, will the selected parameters still work? Or will we need to retrain and experiment again?
- **Large Datasets**: If we had millions of observations, could we afford to keep running random models with random parameters?
    - Which model would be suitable for such large datasets?

#### Dataset Issues:
- What if the dataset itself had issues?
    - What if some features were more relevant than others?
    - Could identifying these features have helped more than just trying different models?

### Key Takeaway:
- **Experimentation** and **iteration** through direct application have their place in machine learning.
- However, **blind experimentation** without understanding the fundamentals won't take us far.
- Understanding how a machine learning algorithm works under the hood opens up additional insights.
- This allows us to experiment and iterate from an informed perspective.

### Next Steps:
- In the upcoming lessons, we'll learn from that other perspective.
- We'll explore a different machine learning algorithm.
- We'll implement it from scratch and then use **scikit-learn** again.
