# Statistics for Data Science [DS401]
## File to Simple LR. Knowing data
#### By: Javier Orduz

[licenseBDG]: https://img.shields.io/badge/License-CC-orange?style=plastic
[license]: https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en

[mywebsiteBDG]:https://img.shields.io/badge/website-jaorduz.github.io-0abeeb?style=plastic
[mywebsite]: https://jaorduz.github.io/

[mygithubBDG-jaorduz]: https://img.shields.io/badge/jaorduz-repos-blue?logo=github&label=jaorduz&style=plastic
[mygithub-jaorduz]: https://github.com/jaorduz/

[mygithubBDG-jaorduc]: https://img.shields.io/badge/jaorduc-repos-blue?logo=github&label=jaorduc&style=plastic 
[mygithub-jaorduc]: https://github.com/jaorduc/

[myXprofileBDG]: https://img.shields.io/static/v1?label=Follow&message=jaorduc&color=2ea44f&style=plastic&logo=X&logoColor=black
[myXprofile]:https://twitter.com/jaorduc


[![website - jaorduz.github.io][mywebsiteBDG]][mywebsite]
[![Github][mygithubBDG-jaorduz]][mygithub-jaorduz]
[![Github][mygithubBDG-jaorduc]][mygithub-jaorduc]
[![Follow @jaorduc][myXprofileBDG]][myXprofile]
[![CC License][licenseBDG]][license]

<h1>Contents</h1>

<div class="alert  alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#unData">Simpler Linear Regression. Knowing Data</a></li>
         <!-- <ol>
             <li><a href="#reData">Reading</a></li>
             <li><a href="#exData">Exploration</a></li>
         </ol> -->
        <li><a href="#daExploration">Querying</a></li>
        <li><a href="#exercises">Exercise</a></li>
        <li><a href="#versions">Versions</a></li>        
        <li><a href="#references">References</a></li>
    </ol>
</div>
<br>
<hr>

<!-- ### <font color='blue'> Linear Regression </font> -->

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

## Simpler Linear Regression. Knowing data

Data source [0].
Linear regression constitutes a seminal technique in statistics and data science, functioning both pragmatically and as a benchmark for sophisticated methods. The field of machine learning centers heavily around linear regression and its manifold variants, expounding core principles through their application. Thus, our objective is to elucidate the formulation and adaptation of linear regression, illuminating the fundamental mechanisms underpinning this ubiquitous methodology. By delving into this material, you will acquire the skills to:
1. Appropriately fit models to encountered data sets.
1. Conduct experiments employing diverse linear regression variants, allowing you to observe their impact.
1. Gain insights into the technological foundations that underpin the functionality of regression models.

This knowledge will empower you to navigate the intricacies of linear regression, fostering a deeper understanding of its utility and versatility in various data-driven contexts.

### Linear Regression

We first examine a toy problem, focusing our efforts on fitting a linear model to a small dataset with three observations.  Each observation consists of one predictor $x_i$ and one response $y_i$ for $i = 1, 2, 3$,

\begin{align*}
\big(x , y\big) = \big\{(x_1, y_1), (x_2, y_2), (x_3, y_3)\big\}.
\end{align*}

To be very concrete, let's set the values of the predictors and responses.

\begin{equation*}
\big(x , y\big) = \big\{(1, 2), (2, 3), (3, 4)\big\}
\end{equation*}

There is no line of the form $$\beta_0 + \beta_1 x = y$$ that passes through all three observations, since the data are not collinear. 
<!--
Thus our aim is to find the line that best fits these observations in the *least-squares sense*, as discussed in lecture.
-->

#### <font color='blue'> Example: Linear Regression </font>


In [None]:
x_train = np.array([1,2,3])
y_train = np.array([2,3,6])
type(x_train)

In [None]:
x_train.shape

In [None]:
x_train=x_train.reshape(3,1)

In [None]:
x_train.shape

In [None]:
plt.scatter(x_train, y_train)

<h3>Dataframe</h3>

<div class="alert  alert-block alert-info" style="margin-top: 20px">
    A DataFrame represents a rectangular table of data and contains an ordered collection 
    of columns, each of which can be a different value type (numeric, string, boolean, etc.). 
    The DataFrame has both a row and column index; it can be thought of as a dict 
    of Series all sharing the same index. Under the hood, the data is stored as one or 
    more two-dimensional blocks rather than a list, dict, or some other collection of 
    one-dimensional arrays.
</div>
<br>
<hr>

In [None]:
df = pd.read_csv("../../data/FuelConsumption.csv")

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
print("Number of rows =", df.shape[0], "\nNumber of features (columns) =",df.shape[1])

In [None]:
df.columns

In [None]:
type(df)

In [None]:
df.head()

In [None]:
df.head(7)

In [None]:
df.describe()

## Querying

Note, pandas considers a table (dataframe) as a pasting of many "series" together, horizontally.

In [None]:
type(df.MODELYEAR), type(df)

In [None]:
df.ENGINESIZE <= 2

In [None]:
SumEng = np.sum(df.ENGINESIZE <= 2)
SumEng

In [None]:
SumEngTotal = np.sum(df.ENGINESIZE <= 2)/df.shape[0]
SumEngTotal

In [None]:
MeanTotal = np.mean(df.ENGINESIZE <= 2.0)
MeanTotal

In [None]:
EngMean = (df.ENGINESIZE <= 2).mean()
EngMean

In [None]:
AverageEng = np.average(df.ENGINESIZE <= 2.0)
AverageEng

##  Exercises

1. Why previous outputs are same?
1. Call another data base, and repeat commands (adjusting to the new variables) on this NB with the new data base.
1. Use at least four or more features and calculate: average, mean, median, sum, and implement at least three more statistics functions. Check the ```numpy``` and ```pandas``` documentation.
<!-- 1. Submmit your report in Moodle. Template https://www.overleaf.com/read/xqcnnnrsspcp -->

## Versions

In [None]:
from platform import python_version
print("python version: ", python_version())

# References

[0] data https://tinyurl.com/2m3vr2xp

[1] numpy https://numpy.org/

[2] scipy https://docs.scipy.org/

[3] matplotlib https://matplotlib.org/

[4] matplotlib.cm https://matplotlib.org/stable/api/cm_api.html

[5] matplotlib.pyplot https://matplotlib.org/stable/api/pyplot_summary.html

[6] pandas https://pandas.pydata.org/docs/

[7] seaborn https://seaborn.pydata.org/
