# Google Colab for Intro to Machine Learning 



*  Colab is a free notebook environment that runs entirely in the cloud. It lets you and your team members edit documents.
*  Essentially, it is a free Jupyter notebook that runs code on google servers instead of your local PC. 
*  It is a web-based IDE (interactive development environment) that combines live code, visualization, equations, etc.
*  You can also access files stored in your drive using the mount option. 
*  Most importantly, it **supports** many popular **machine learning libraries** which can be easily loaded into your colab notebook.




**Different Api's that are used for ML in python**

*   Scipy
*   Scikit-learn
*   Theano
*   TensorFlow
*   Keras
*   **PyTorch**
*   Pandas



Importing Some Python Libraries that are Required

In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

Downloading the Dataset from pytorch 

In [None]:
training_data = datasets.MNIST(root="data",train=True, download=True, transform=ToTensor())

test_data = datasets.MNIST(root="data", train=False, download=True, transform=ToTensor())

Finding the length of Training and Testing set

In [None]:
print(training_data.targets.shape)
print(test_data.targets.shape)

Plotting 10 random examples of each digit

In [None]:
num_row = 10
num_col = 10
fig, axes = plt.subplots(num_row, num_col, figsize=(1.5*num_col,2*num_row))

for i in range(10):
  idx = (training_data.targets==i)
  temp_data = training_data.data[idx]
  temp_target = training_data.targets[idx]

  for j in range(10):
    idx1 = np.random.randint(0,len(temp_target))
    ax = axes[i, j]
    ax.imshow(temp_data[idx1], cmap='gray')
    ax.set_title('Label: {}'.format(temp_target[idx1]))
plt.tight_layout()
plt.show

## Introduction
#### **MOUNT GOOGLE DRIVE**

With just two lines of code we can mount your google drive to load/save files.


PS: A prompt will be generated to verify the access. 


In [None]:
from google.colab import drive
drive.mount('/content/drive')

**BASH**

*   It is a Command Line Interpreter(CLI) which is a form of using the Operating System(OS). 
*   It allows the user to communicate with the kernel (a program at the core of a computer's OS).
*   **BASH** within Google Colab can be accessed with **!** or **%**.



Some Useful BASH commands:

## **`echo`**

**`echo` is a command that outputs the strings that are passed to it as arguments. `echo` can be thought of as a print function.**

In [None]:
!echo hello world

## **`man`**

**`man` or "manual", is an interface to the system reference manuals. Manual is used to describe commands.**

In [None]:
!man echo

## **`pwd`**

**`pwd` or "print work directory", returns the current working directory. The default working directory for Google Colab is the `/content` folder.**

In [None]:
!pwd

## **`cd`**

**`cd` or "change directory", is used to change the working directory. The path name is input after `cd`.**

In [None]:
%cd /content/drive/MyDrive/ML_Demo_4123/Datasets

## **`ls`**

**`ls` or "list", prints out the content within the current working directory.**

In [None]:
!ls

In [None]:
file_out=pd.read_csv('/content/drive/MyDrive/ML_Demo_5722/Cric_data.csv')
print(file_out.head())

In [None]:
class Featuredataset(Dataset):
    def __init__(self,filename):
    #read csv file and load row data into variable
      file_out=pd.read_csv(filename)
      x=file_out.iloc[0:,  1:6].values
      y=file_out.iloc[0:,  7].values
      self.X_data=torch.tensor(x, dtype=torch.float32)
      self.Y_data=torch.tensor(y)
      print(self.X_data)

  

    def __len__(self):
      return len(self.Y_data)
    
    def __getitem__(self,idx):
      return self.X_train[idx],self.Y_train[idx]

    def plotterdata(self):
      #_______________________________Bowiling avg VS Batting Average___________________________________
      plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
      plt.scatter(self.X_data[self.Y_data==0,2],self.X_data[self.Y_data==0,4],color = 'blue',marker= '*', label='All rounder')
      plt.scatter(self.X_data[self.Y_data==-1,2],self.X_data[self.Y_data==-1,4],color = 'hotpink',marker= 'o', label='Batsman')
      plt.scatter(self.X_data[self.Y_data==1,2],self.X_data[self.Y_data==1,4],color = 'red',marker= 'x', label='Bowler')
      plt.title('Data visualization of Cricket data')
      plt.xlabel('Batting Average')
      plt.ylabel('Bowling Average')
      plt.legend(loc='best')
      plt.show()
      #_______________________________Wickets taken VS Runs scored___________________________________
      plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
      plt.scatter(self.X_data[self.Y_data==0,1],self.X_data[self.Y_data==0,3],color = 'blue',marker= '*', label='All rounder')
      plt.scatter(self.X_data[self.Y_data==-1,1],self.X_data[self.Y_data==-1,3],color = 'hotpink',marker= 'o', label='Batsman')
      plt.scatter(self.X_data[self.Y_data==1,1],self.X_data[self.Y_data==1,3],color = 'red',marker= 'x', label='Bowler')
      plt.title('Data visualization of Cricket data')
      plt.xlabel('Runs Scored')
      plt.ylabel('Wickets Taken')
      plt.legend(loc='best')
      plt.show()
    
    def train_test_split(self):
      #Split in 80:20 ratio
      self.X_train, self.X_test, self.y_train,self.y_test = train_test_split(self.X_data, self.Y_data,test_size=0.2)
      print(len(self.y_train))
      print(len(self.y_test))


    def trainset(self):
      return self.X_train,self.y_train
    
    def testset(self):
      return self.X_test,self.y_test


In [None]:
featureset= Featuredataset('/content/drive/MyDrive/ML_Demo_5722/Cric_data.csv')
featureset.train_test_split()
trainloader= DataLoader(featureset.trainset,batch_size=10,shuffle=False)


In [None]:
l=featureset.__len__()
print(l)

In [None]:
featureset.plotterdata()

# Linear Regression
Using Linear regression, a relationship between the domain set $\mathcal{X}$ is a subset of $\mathrm{R}^d$, for some $d$, and the label set $\mathcal{Y}$ is the set of real numbers can be obtained that best approximates the relation. i.e. find a $h:\mathrm{R}^d→\mathrm{R}$.  We will start with a simple regression model, i.e. d=1 or fitting a straight line to a data. 

$$y=ax+b$$

Here $y$ is observed values, $a$ is the slope, $b$ is the intercept and $x$ is the measured values. 


We will start by loading all the standard libraries. 

In [None]:
import matplotlib.pyplot as plt #For plotting
import numpy as np # For array manipulation

## Toy Example
Consider a the data pair $(x,y)$ generated with $a=5$ and $b=-2$. 

In [None]:
x = 10 * np.random.rand(50);#Actual Data
y = 5 * x - 20 +2*np.random.randn(50);#Observed Data
'''
Plot data
'''
plt.scatter(x, y);
plt.xlabel('x');
plt.ylabel('y');

If we pick a point, $x$ not in the data set, we can use the a line with slope $a$ and intercept $b$ to predict the value of $y$. 

In machine learning, learning we often call this a *model* since it is what
we use to make our predictions.

We call this model a *linear* model since we fit the data with a line. When a learning model fits data by a line, we say it is linear. 

*Regression* is relationship between correlated variables.

When you have just one input value (feature), linear regression is the problem if fitting a line to the data and using this line to make new predictions. When its more than one input variable, linear regression is th problem of fitting a hyperplane to the data and using this hyperplane to make new predictions. 

Here, given the data pair $(x,y)$, we have to estimate the values of coefficient ($a$) and intercept $(b)$. We will be using the Scikit-Learn's Linear regression estimator to fit the data. 

Here the loss function we are trying to minimize is $L(\hat{a},\hat{b})=\|y-(\hat{a}x+̂{b})\|^2$.

In [None]:
from sklearn.linear_model import LinearRegression 
model = LinearRegression(fit_intercept=True) # Specifying the model.
model.fit(x[:, np.newaxis], y); #fit the model
'''
Plot the data
'''
xfit = np.linspace(0, 10, 1000);
yfit = model.predict(xfit[:, np.newaxis]);
plt.scatter(x, y,label='Observed Value');
plt.plot(xfit, yfit,'g',label='Estimate');
plt.xlabel('x');
plt.ylabel('y');
plt.legend();

Here, the LinearRegression module of sklearn uses Ordinary Least Squares. Ordinary least squares, or linear least squares, estimates the parameters in a regression model by minimizing the sum of the squared residuals.

# Simple Linear Regression 

In this example we will consider sales based on 'TV' marketing budget. 

In this notebook, we'll build a linear regression model to predict 'Sales' using 'TV' as the predictor variable.


## Understanding the Data

Let's start with the following steps:

1. Importing data using the pandas library
2. Understanding the structure of the data

In [None]:
import pandas as pd

In [None]:
# Reading csv file from github repo
advertising = pd.read_csv('tvmarketing.csv')

Now, let's check the structure of the advertising dataset.

In [None]:
# Display the first 5 rows
advertising.head()

In [None]:
# Display the last 5 rows
advertising.tail()

In [None]:
# Let's check the columns
advertising.info()

In [None]:
# Check the shape of the DataFrame (rows, columns)
advertising.shape

In [None]:
# Let's look at some statistical information about the dataframe.
advertising.describe()

# Visualising Data Using Plot

In [None]:
# Visualise the relationship between the features and the response using scatterplots
advertising.plot(x='TV',y='Sales',kind='scatter')

# Perfroming Simple Linear Regression

Equation of linear regression<br>
$y = c + m_1x_1 + m_2x_2 + ... + m_nx_n$

-  $y$ is the response
-  $c$ is the intercept
-  $m_1$ is the coefficient for the first feature
-  $m_n$ is the coefficient for the nth feature<br>

In our case:

$y = c + m_1 \times TV$

The $m$ values are called the model **coefficients** or **model parameters**.

### Generic Steps in Model Building using ```sklearn```

Before you read further, it is good to understand the generic structure of modeling using the scikit-learn library. Broadly, the steps to build any model can be divided as follows: 

## Preparing X and y

-  The scikit-learn library expects X (feature variable) and y (response variable) to be NumPy arrays.
-  However, X can be a dataframe as Pandas is built over NumPy.

In [None]:
# Putting feature variable to X
X = advertising['TV']
# Putting response variable to y
y = advertising['Sales']
#random_state is the seed used by the random number generator, it can be any integer.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 , random_state=0000)

In [None]:
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))

In [None]:
#It is a general convention in scikit-learn that observations are rows, while features are columns. 
#This is needed only when you are using a single feature; in this case, 'TV'.

import numpy as np
#Simply put, numpy.newaxis is used to increase the dimension of the existing array by one more dimension,
X_train = X_train[:, np.newaxis]
X_test = X_test[:, np.newaxis]

## Performing Linear Regression

In [None]:
# import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Representing LinearRegression as lr(Creating LinearRegression Object)
lr = LinearRegression()

# Fit the model using lr.fit()
lr.fit(X_train, y_train)

## Coefficients Calculation

In [None]:
# Print the intercept and coefficients
print(lr.intercept_)
print(lr.coef_)

$y = 6.989 + 0.0464 \times TV $<br>

Now, let's use this equation to predict our sales.

## Predictions

In [None]:
# Making predictions on the testing set
y_pred = lr.predict(X_test)

In [None]:
type(y_pred)

Some References:


1.   https://www.geeksforgeeks.org/best-python-libraries-for-machine-learning/
2.   https://colab.research.google.com/?utm_source=scs-index#scrollTo=-Rh3-Vt9Nev9
3. https://pytorch.org/tutorials/beginner/basics/intro.html
4. https://youtu.be/E54volo2B2s
5. https://github.com/tudev/Workshops-2020-2021/tree/master/Google%20Colab%20Tutorials
6. https://www.tutorialspoint.com/google_colab/


