<a href="https://colab.research.google.com/github/H3nr7M/machine_learning_101/blob/main/Machine_learning_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning
Machine learning is a rapidly growing field that has transformed the way we approach problems in a variety of domains, including finance, healthcare, and technology. At its core, machine learning involves building algorithms that can automatically learn patterns in data and make predictions or decisions based on those patterns.

If you're new to machine learning, it can be overwhelming to know where to start. This repository provides a basic introduction to machine learning, we going to focus on supervised learning, which is the most common type of machine learning problem. We will also cover some of the most important concepts in machine learning, and we build a model and do all the process from scratch.

So let's get started!

## Introduction

Machine learning (ML) is a type of artificial intelligence (AI), see the image below, that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Classical machine learning is often categorized by how an algorithm learns to become more accurate in its predictions. There are four basic approaches:supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. 



<img src='machine.png'>

## What is an aproach?

An approach is a way of doing something. In machine learning, an approach is a way of training a model, see the imagen below. Those types of approaches will be discussed in the following sections.

<img src='types.png'>

In the aproaches of supervised learning and unsupervised learning we have a lot of algorithms that can be used to train a model and each algorithm has its own characteristics and it's better suited for certain types of problems like classification, regression and clustering in the image below we have a brief of the algorithms and uses of each one.

<img src='supervised.png'>

### Supervised learning

In this type of machine learning, data scientists (the person in charge) supply algorithms with labeled training data and define the variables they want the algorithm to assess for correlations. Both the input and the output of the algorithm is specified, types of tasks supervised learning can solve and their algorithms.


<img src='bannana.png'>

After start with our first project we have to undertand some basic concepts

## Procesing data

In the machine learning training process, data processing is a critical step that involves preparing the data for use in the model training. Here are some common tasks involved in data processing and their importance:

- Manage missing values: It is common for datasets to have missing values. These missing values can cause errors in the model training process. Therefore, it is important to handle these missing values before proceeding with the model training. There are several techniques to manage missing values, including dropping the rows with missing values or imputing missing values using techniques such as mean or median imputation.

- Label encoding: Many machine learning algorithms require the data to be in numeric form. Therefore, categorical variables need to be converted into numeric form. Label encoding is one of the techniques used to perform this conversion. In label encoding, each unique value in a categorical variable is assigned a unique integer value.

- Handle imbalanced datasets: Imbalanced datasets occur when the number of instances in one class is significantly larger or smaller than the other classes. Handling imbalanced datasets is critical to ensure that the model is not biased towards the majority class. Techniques such as oversampling, undersampling, and generating synthetic samples can be used to handle imbalanced datasets.

- Standardization of the data: Standardization is the process of scaling the data to have zero mean and unit variance. Standardization is important to ensure that the features are on a similar scale and to improve the performance of some machine learning algorithms.

- Split our dataset: It is important to split the data into training and test datasets. The training dataset is used to train the model, while the test dataset is used to evaluate the performance of the model. It is important to ensure that the model is not overfitting to the training data and performs well on unseen data.


As you can see, data processing is an essential step in the machine learning training process that involves preparing the data for use in the model training. The tasks involved in data processing, such as managing missing values, label encoding, handling imbalanced datasets, standardization, and data splitting, ensure that the model is trained on high-quality data and performs well on unseen data.

### Manage missings values
It is common for datasets to have missing values. These missing values can cause errors in the model training process. Therefore, it is important to handle these missing values before proceeding with the model training. There are several techniques to manage missing values, including dropping the rows with missing values or imputing missing values using techniques such as mean or median imputation.

In this project we going to use sklearn.datasets, also we going to use pandas and numpy to manage our data.


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn.datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [3]:
# loading the dataset in our folder to a Pandas DataFrame
dataset = pd.read_csv('Placement_Dataset.csv')

In [4]:
# the following command will show the nulls in the dataset, we have nulls values in the salary column
dataset.isnull().sum()

sl_no              0
gender             0
ssc_p              0
ssc_b              0
hsc_p              0
hsc_b              0
hsc_s              0
degree_p           0
degree_t           0
workex             0
etest_p            0
specialisation     0
mba_p              0
status             0
salary            67
dtype: int64

There are several techniques to manage missing values, including dropping the rows with missing values or imputing missing values using techniques such as mean or median imputation. In section below we show 4 ways to  manage missings values be careful and use the best for your project and don't use all of them in the same project.


In [8]:
#This command will replace the nulls values with the median of the salary column
dataset['salary'].fillna(dataset['salary'].median(),inplace=True) 

In [None]:
#This command will replace the nulls values with the mean of the salary column
dataset['salary'].fillna(dataset['salary'].mean(),inplace=True) 


In [None]:
#This command will replace the nulls values with the mode of the salary column
dataset['salary'].fillna(dataset['salary'].mode(),inplace=True)

In [6]:
#This command will drop the rows with nulls values
dataset = dataset.dropna(how='any')

After manage the nulls values we can run the `dataset.isnull().sum()` and see that there are no nulls values in our dataset.

In [7]:
dataset.isnull().sum()

sl_no             0
gender            0
ssc_p             0
ssc_b             0
hsc_p             0
hsc_b             0
hsc_s             0
degree_p          0
degree_t          0
workex            0
etest_p           0
specialisation    0
mba_p             0
status            0
salary            0
dtype: int64

## Label Encoding

Many machine learning algorithms require the data to be in numeric form. Therefore, categorical variables need to be converted into numeric form. Label encoding is one of the techniques used to perform this conversion. In label encoding, each unique value in a categorical variable is assigned a unique integer value.

In [8]:
# loading the data from csv file to pandas dataFrame
cancer_data = pd.read_csv('data.csv')

In [9]:
# finding the count of different labels
cancer_data['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

As you see we have 2 categorical variables in our dataset, B for benign and M for Malign, we going to use the `LabelEncoder` from sklearn to convert our categorical variables into numeric form.

In [10]:
# load the Label Encoder function
label_encode = LabelEncoder()
labels = label_encode.fit_transform(cancer_data.diagnosis)
# appending the labels to the DataFrame
cancer_data['target'] = labels

After convert our categorical variables into numeric form we can see that we have 2 variables with 0 and 1 values.

In [11]:
cancer_data['target'].value_counts()

0    357
1    212
Name: target, dtype: int64

## Handle inbalanced datasets

Imbalanced datasets occur when the number of instances in one class is significantly larger or smaller than the other classes. Handling imbalanced datasets is critical to ensure that the model is not biased towards the majority class. Techniques such as oversampling, undersampling, and generating synthetic samples can be used to handle imbalanced datasets.

In [12]:
# Make a sample of the data with the same proportion of the labels, because the original proportion was 357 benign and 212 malignant

fine = cancer_data[cancer_data.target == 0]
wrong = cancer_data[cancer_data.target == 1]

fine_sample = fine.sample(n=212)

new_dataset = pd.concat([fine_sample, wrong], axis = 0)

new_dataset['target'].value_counts()

0    212
1    212
Name: target, dtype: int64

Now we have a balanced dataset.

## Standardization of the data

Standardization is the process of scaling the data to have zero mean and unit variance. Standardization is important to ensure that the features are on a similar scale and to improve the performance of some machine learning algorithms.

In [13]:
# loading the dataset
dataset = sklearn.datasets.load_breast_cancer()

In [14]:
# See the features of the dataset
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

In [16]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


As you can see we have a dataset with different scales for the different variables of our dataset, so we going to use the `StandardScaler` from sklearn to standardize our data and have a dataset with zero mean and unit variance.

In [17]:
# Splitting labels and features
X = df 
Y = dataset.target

In [18]:
# Splitting the dataset into training and testing data in a 20% proportion
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)
print(X.shape, X_train.shape, X_test.shape)

(569, 30) (455, 30) (114, 30)


In [19]:
# This command will show the mean of the dataset
print(dataset.data.std())

228.29740508276657


In [22]:
# Standardizing the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)

In [23]:
# Show the mean of the standardized data after the standardization
print(X_train_standardized.std())
print(X_test_standardized.std())

1.0
0.8654541077212674


## Split our dataset

It is important to split the data into training and test datasets. The training dataset is used to train the model, while the test dataset is used to evaluate the performance of the model. It is important to ensure that the model is not overfitting to the training data and performs well on unseen data.

In [24]:
# loading the data from csv file to pandas dataFrame
iris_data = pd.read_csv('iris_data.csv')

In [25]:
# loding the label encoder, this you know by now :)
label_encoder_1 = LabelEncoder()
iris_labels = label_encoder_1.fit_transform(iris_data.Species)

In [27]:
# separating the data and labels
iris_data1= iris_data.drop(columns = 'Species', axis=1)
X=iris_data1.drop(columns='target',axis=1)
Y = iris_data['target']

In [28]:
# Standardizing the data
scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)
X = standardized_data

In [29]:
# Splitting the dataset into training and testing data in a 20% proportion
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [30]:
# This command will show the sizes of the training and testing data
print(X.shape, X_train.shape, X_test.shape)

(150, 5) (120, 5) (30, 5)


In conclusion, this repository provides a basic introduction to machine learning, covering the fundamental concepts and techniques that underpin this field. We explored key concepts such as supervised and unsupervised learning.

We also delved into the practical side of machine learning, including managing missing values, label encoding, handling imbalanced datasets, standardization, and data splitting. We covered the importance of these techniques in preparing the data for use in the model training, and we demonstrated how to implement these techniques using popular Python libraries such as Pandas, Scikit-learn.

Throughout this repository, we worked on a practical example to illustrate the application of these techniques. We used a real-world dataset and implemented these techniques to prepare the data for use in a machine learning model. This practical example provided a hands-on experience and helped to solidify the concepts covered in the repository.

You can use this knowledge to start building your own machine learning models and apply them to real-world problems.