# Statistical Thinking 

🏁 Welcome! This is Jupyter Notebook is all about statistics and how to think probablistically 

## 📚 Libraries

In [2]:
import numpy as np 
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt 
import seaborn as sns


my_colors = ["#ce8f5a", "#efd199", "#80c8bc", "#5ec0ca", "#6287a2"]
sns.palplot(sns.color_palette(my_colors))

# Set Style

sns.set_style("white")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'


 # What is the purpose of this notebook 🤔 
 
 <p>🟢The main purpose is to discuss and refresh some information that is introduced on Stanford introduction to statistics course and to highlight basic but hidden important notes about the importants of statistics in the pocket tools for Data related fields and/or almost any job in real world </p>



## 1. what is data and why it is  important ?

Before discussing what is statistics, it's more important to introduce data, which I look at them as tools (statistics) and a material (data).We must know the types of data and how we want to shape it to deal with it with the right "tools".

<b>Data</b> is defined as distinct pieces of information and it can come in many forms. From numbers in a spreadsheet, text to video and databases, to images and audio recordings, utilizing data in its different forms is the new way of the world.

Data is used to understand and improve nearly every facet of our lives. So, no matter what field you are in, you can utilize data to make better decisions and accomplish your goals.

So manily if any bussiness wants to evolve, expand, or fix issues in its system they will need to have a data about that thing they want to change. 

### <b>If you can’t measure it, you can’t improve it</b>


 <img src="https://static.vecteezy.com/system/resources/previews/000/650/182/original/thinking-about-statistics-vector.jpg" alt="Drawing" style="width: 400px;"/>

## 2. Elements of Structured Data



<b>Data</b> comes from many sources: sensor measurements, events, images, and videos. Much of this data is unstructured: images are a collection of pixels, with each pixel containing RGB color formation. Click Streams are sequences of  actions by user interacting with an app or web page. In fact, a major challenge of data science is to harness this torrent of raw data into actionable information. To apply statistical concepts, unstructured raw data must be processed and manipulated into a structured form. One of the commonest forms of structured data is a table with rows and columns.

<img src = "https://lawtomated.com/wp-content/uploads/2019/04/structuredVsUnstructuredIgneos.png" alt="Drawing" style="width: 900px;" />

## 3.Types of Structured data 

*There are two basic types of structured data: Numeric and Categorical. Numeric data comes in two forms: Continuous, such as wind speed or time duration, and discrete, such as the count of the occurrence of an event. Categorical data takes only a fixed set of values, such as a type of TV screen or a state name. Binary data is an important special case if categorical data that takes only one of two values , such as 0/1, yes/no. Another useful type of categorical data is Ordinal data in which the categories are ordered; an example of this is a numerical rating (1,2,3,4, or 5)*

<img src = "http://intellspot.com/wp-content/uploads/2018/08/Types-of-Data-Infographic.png" alt="Drawing" style="width: 900px;" />

## 4. Now why we need Statistics ?  🧐

*We can Start by it's official definition to get a hint about it :*

> Statistics is a method of interpreting, analysing and summarising the data. Hence, the types of statistics are categorised based on these features: Descriptive and inferential statistics. Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and interpret it.

> In terms of mathematical analysis, the statistics include linear algebra, stochastic study, differential equation and measure-theoretic probability theory.


<img src = "https://image.slideserve.com/274561/two-types-of-statistics-l.jpg" alt="Drawing" style="width: 900px;" />

## 5. Descriptive Statistics 

**Summarize and organize characteristics of data set**

There are 3 main types of descriptive statistics:

* The distribution concerns the frequency of each value.
* The central tendency concerns the averages of the values.
* The variability or dispersion concerns how spread out the values are.


#### Frequency distribution 
A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or percentages. 

#### Estimates of Location 
Variables with measured or count data might have thousands of distinct values. Abasic step in exploring your data is getting a "typical value" for each feature (varaible):
an estimate of where most of the data is located (i.e., its central tendency).

**Mean**: The sum of all values divided by the number of values.  
*Synonym*: "Average"

**Median**: The value such that one-half of the data lies above and below.

**Percentile**: The value such that *P* percent of the data lies below.  
         *Synonym*: "quantile"

And there are two important definations we will need to know about:

**Robust**: Not Sensitive to extreme values.        
 
**Outlier**: A data value that is very different from most of the data


![](https://media.giphy.com/media/9ADoZQgs0tyww/giphy.gif)

In [3]:
df = pd.read_csv("../input/inputdatasets/2008_all_states.csv")
df.head()
# this is a dataset that will be used to calarify more on the points that will be mentioned.
# this dataset is about voters in different states in us for the 2008 election. 

In [4]:
# we can get to know about the number of voters in every state

tot_mean = df.dem_share.mean()

tot_median = df.dem_share.median()

print(tot_mean)
tot_median

## By knowing that (mean) is sensitive to outliers 
## And median is Robust. 
## We can conclude at first look that dem_share is some what normally distributed (number of outliers is not big )

So by using easy descriptive stats mean,median we can have a first look for our dataset. 
But this is not know for us to be sure about conclusions can be made only using one number stats like mean and median 
another useful way is to visualize the distribution of the data to get to know the whole overview.

**If you can appropriately display your data, you can already start to draw Conclusions from it.**

## 6. Exploratory Data Analysis (EDA)

Classically,  statistics has been focused entirely on inference, drawing conclusions on large populations using small samples.  John W. Tukey wrote a seminal paper in 1962 called “ The Future of Data Analysis “ and proposed a new scientific discipline called data analysis which included statistical inference as one component but also considered engineering and computer science.

The field of exploratory data analysis was established with Tukey’s 1977 now-classic book “ Exploratory Data Analysis”.

### Exploring the data
 The process of organizing, plotting, and       summarizing a dataset 

**We should explore the data first**
this involves taking data from tabular form and representing it graphically.

*We are representing the same information, but it is in a more human-interpretable form*

## 7. Graphical exploratory data analysis 

Exploring the data is the process of organizing, plotting, and summarizing a dataset.
We should explore our data first this involves taking data from tabular form and representing it graphically.
we are representing the same information, but it is in more human-interpretable from. for example, we can take democratic share of votes and plot them as a histogram. Just by making one plot, we can could already draw a conclusion about the data.
“It is best to use a graphical summary to communicate information, because people prefer to look at pictures rather than at numbers.” 


<img src = "https://media.giphy.com/media/TJP7EH5i1fB2rKeWbf/giphy.gif" alt="Drawing" style="width: 300px;" />

### 7.1 Histogram

In [5]:
# for example, we take democratic share of vote
# And plot them as a histogram ==>
_ = sns.histplot(df['dem_share'], stat='percent', bins=20)
_ = plt.xlabel("Percent of vote for obama")
_ = plt.ylabel("Number of countries")


**A major drawback of using histograms**, is that the same dataset can look different depending on how the **bins** are chosen. And the choice of bins is in many ways arbitary.

**Binning Bias**: the same data may be interpreted differently depending on choice of bins. 

Additional problem with histograms is that we're not plotting all of the data, we are sweeping the data into bins, and losing their actual values.

## 7.2 Bee Swarm
*The position along the y-axis is the 'quantitative information' the data is spread in x to make them visible.*
Notably, we no longer have any binning bias and all data are displayed. 
**Note**: A requirment is that your data are in a "well-organized" pandas dataframe.
were each column is a feature & row is an observation. 

In [6]:
df_2 = df[df.state.isin(['PA','OH', 'FL'])]
df_2.head()
## to get only the swing states 

In [7]:
_ = sns.swarmplot(x='state', y='dem_share', data=df_2);
_ = plt.xlabel("State")
_ = plt.ylabel("Percent of vote for obama")

## 7.2 Empirical cumulative distribution function (ECDFs)

**Why bee-swarm is not also the best option**
> We saw the clarity of bee-swarm plots.However, there is a limit to their efficary. for example, imagine we wanted to plot the country-level voting data for all states. *The bee-swarm has a real problem, the edges have overlapping data points, which was nessary in order to fit all points onto the plot.*

**ECDFs**: The x-axis is the quantity you are measuring. The y-axis is the fraction of data points that have a value smaller than the corresponding x-axis. 
*We can also easily plot multiple ECDFs*

**Making an ECDF** ==> The x-axis us the sorted data.*We need to generate it using the numpy function "sort"*. np.sort(x) = X

==> The y-axis is evenly spaced data points with a maximum of one.

*Which we can generate using* => np.arange(1,len(x)+1) / len(x)  == Y

**As its a repeatable process we can make a function for it **

In [8]:
def ecdf(data):
    """Compute ECDF for a 1-D array of measurments."""
    
    # number of data points: n 
    n = len(data)
    
    # x-data for the ECDF: x
    x = np.sort(data)
    
    # y-data for the ECDF: y
    y = np.arange(1, len(x)+1) / n
    
    return x, y

In [9]:
x, y = ecdf(df['dem_share'])

plt.plot(x,y, marker='.', linestyle = 'none')

plt.xlabel("Percent of vote for obama")

plt.ylabel("ECDF")

# keep data off plot edges
plt.margins(0.02)

## 8. Quantitative Exploratory Data Analysis

> We often would like to summarize data even more succinctly, say in one or two numbers. 

These numerical summaries are **not by any stretch** a substitute for the graphical methods , they do take up less real state. 

In [10]:
z = df_2[df_2.state == 'PA']
np.mean(z.dem_share) # this computes the mean for a column.

We might like a summary statistic that is immune to extreme data.
**The median**: provides exactly that, the middle value of a dataset.

In [11]:
np.median(z.dem_share)

## 8.1 Percentiles, outliers, and boxplots
*The median is the special name for the (50th percentile), that is 50% of the data are less than the median.*

Similarly, the(25th percentile) is the value of the data point that is greater than 25% of the sorted data, and so on for any other percentile we want.

**Percentiles** are useful summary statistics, and can be computed --> np.percentile()
*It takes a column data as a first argument and a list of the percentiles we want as a second argument (Percentiles NOT Fractions)*

**Now, we've three summary statistics.**The whole point of summary statistics was to keep things concise, but we're starting to get a lot of numbers here.

**Dealing with this issue is where "Quantitative" EDA meets "Graphical" EDA**

**Box Plots** --> were invented to display some of the statistical or salient features of the dataset based on percentiles.

==> The center of the box is the median (50th percentile), the edges of the box are the 25th & 75th percentiles. The total height of the box contains the middle 50% of the data, and is called (the interquartile) range, the whiskers extend a distance of (1.5) * IQR, or the extend of the data.

<img src="https://miro.medium.com/max/8000/1*0MPDTLn8KoLApoFvI0P2vQ.png" alt="Drawing" style="width: 900px;" />

In [12]:
df.head()

In [13]:
percentiles = np.array([2.5, 25, 50, 75, 97.5])
                        
voters_percentiles = np.percentile(df.dem_share, percentiles)
print(voters_percentiles)

In [14]:
_ = sns.boxplot(x='east_west', y='dem_share', data=df)

_ = plt.xlabel("region")
_ = plt.ylabel("Percent of vote for Obama")

## 8.2 Variance and standard deviation 

**What about the variability, or the spread of the data ?**

**Variance** --> the mean (Average) squared distance of the data from their mean. To quantify this spread, informally, a measure of the spread of the data.

$$ For each data point, we square the distance from the mean, and then take the average of all of these values.   

this is calculated with --> np.var() 
 

<b> Now, because the calculations of the varience invloves squared quantities, it does not have the same units of what we've measured. Therefore, we are interested in the square root of the varience.
 </b>
 
 <b>Standard Deviation</b> --> Is the square root of the varience. 
 
 **A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.**
 
 this is calculated with --> np.std() 
 
 *Is a reasonable metric for the typical spread of the data.*

In [15]:
# Array of differences to mean: differences
differences = df.dem_share - np.mean(df.dem_share)


# Square the differences: diff_sq
diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np

variance_np = np.var(df.dem_share)
# Print the results
print(variance_explicit , variance_np)

In [16]:
# Compute the variance: variance
variance = np.var(df.dem_share)

# Print the square root of the variance
print(np.sqrt(variance))

# Print the standard deviation
print(np.std(df.dem_share))

## 8.3 Covarience and the pearson correlation coefficient

*We would like to have a summary statistic to go along with the information we have just gleaned from the scatter plot below.*

**Covarience** --> A measure of how two quantities vary together.  

*np.cov()*

to understand where it comes from, let's annotate the scatter plot with the means of the two quantities we're interested in.

The covarience is the mean of the product of these differences. 

**If (x) & (y) both tend to be above, or both below their respective means together.**

as they are in this dataset, then the covariance is positive. 

*This means that they are positively correlated: when (x) is high so is (y)*

Conversely, if (x) is high while (y) is low, the coveriance is negative.

**If we want to have a more generally applicable measure of how two variables depend on each other, we want it to be dimensionless, that is to not have any units.**

**Pearson correlation coefficient** --> It is a comparison of the variability in the data due to covarience to the variablility inherent to each variable independently (their standard deviation ) * It ranges from (-1) to (1)*

np.corrcoef()

Another definition for PCC -->  is the covariance of the two variables divided by the product of their standard deviations.

<img src="https://usersolutions.com/wp-content/uploads/2014/03/Covariance.png" alt="Drawing" style="width: 350px;" />

![image.png](attachment:80af24e3-34da-43a4-b4ec-fe1515816123.png)

In [17]:
_ = sns.scatterplot(x='total_votes', y='dem_share', data=df_2)
_ = plt.xlabel("Percent of vote for Obama")
_ = plt.ylabel("Total votes (thousands)")

In [18]:
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y)

    # Return entry [0,1]
    return corr_mat[0,1]

In [19]:
df_2.head()

In [20]:
_ = sns.scatterplot(x='dem_votes', y='rep_votes', data=df_2)
_ = plt.xlabel("Percent of vote for Obama")
_ = plt.ylabel("Total votes (thousands)")

In [21]:
# Compute Pearson correlation coefficient 
r = pearson_r(df_2.dem_votes , df_2.rep_votes)

# Print the result
print(r)

## 9. Probablistic logic and statistical inference.
*Probablistic reasoning allows us to describe uncertainty.*

for example, though you can't tell me exactly what the mean of the next 50 petal lengths you measure will be, you could say "That is more probably to be close to what you got in the first 50 measurments that it is to be much greater."

<b>That is what probablistic thinking is all about --> given a set of data, you describe probablistically what you might expect if those data were acquired again and again <b/>
    
<b>This is the heart of statistical inference</b>

<b>*It is the process by which we go from measured data to probablistic conclusion about what we might expect if we collected the same data again.* </b>
    
 "Your Data Speak In The Language Of Probablility"

<img src="https://media.giphy.com/media/TWFrH2pBDUCnzwKsQz/giphy.gif" alt="Drawing" style="width: 350px;" />

## 9.1 Discrete variables (Random number generator & hacker statistics)

In practice, we're going to think probablistically using hacker stats.

**Hacker Statistics --> Uses simulated repeated measurments to compute probabilities**

*np.random()* module, a suite of functions based on random number generation.

*np.random.random()* --> Draw a number between 0 and 1. 

**Bernolli trail --> *An experiment that has two options, 'Success'(True) and 'Fail'(False)***

*np.random.seed()* --> integer fed into random number generating algorithm. Manually seed random number generator if you need reproducibility.

**Hacker stats probabilities --> Determine how to simulate data, simulate it many many times. Probability is approximately fraction of trails with the outcome of interest.** 

In [22]:
## E.x : Coin flips  
np.random.random(size=4)

In [23]:
## We want to know the probability of getting (4) heads if we were to repeat the four flips over and over again. (Using for loop)

n_all_heads = 0 # we first initialize the counter to zero 

for _ in range(10000):
    
    heads = np.random.random(size=4) < 0.5
    n_heads = np.sum(heads)
    
    if n_heads == 4:
        n_all_heads += 1
        
n_all_heads / 10000  

## 9.2 Probability distributions and stories : The binomial distribution

**Probability mass function (PMF) --> *The set of probabilities of descrete outcomes.***

To understand how this works, 

(e.x)- a person rolling a die once.

The outcomes are discrete because only certain values may be attained. Each result has the same or uniform probability (1/6).

The PMF associated with this story is called --> Discrete uniform PMF.

Now the PMF is a property of a discrete probability distribution.

**Probability distribution --> A mathematical description of outcomes.**

**Discrete uniform distribution: The story --> "the outcome of rolling a single fair die is:"**

*outcome* == DISCRETE #####################  *rolling* == UNIFORMALY DISTRIBUTED

**Binomial  distribution: The story --> "the number (r) of successes in (n) bernolli trails with probability (p) of success."**

with probability (p) of success ==> *np.random.binomial(trails, probab. of success, size = 10)*

the "ECDF" is just informative and easier to plot. 

*(Size) key-word argument, which tells the fraction how many random numbers to sample out of the binomial distribution.*

In [24]:
def perform_bernolli_trails(n, p):
    """perform (n) Bernolli trails with success probability (p) 
       and return number of successes."""
    
    # Initialize number of successes: n_success
    n_success = 0 
    
    # Perform trails
    for i in range(n):
        # choose random number between zero & one: random_number
        random_number = np.random.random()
        
        # If less than (p), it is a success so add one to n_success
        if random_number < p:
            n_success += 1
            
    return n_success

In [25]:
# Seed random number generator 
np.random.seed(42)

#Initialize the number of defaults: n_defaults 
n_defaults = np.empty(1000)

# compute the number of defaults 
for i in range(1000):
    n_defaults[i] = perform_bernolli_trails(100, 0.05)
    
# plot the histogram with the default number of bins

_ = plt.hist(n_defaults)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

In [26]:
# Compute ECDF: x, y
x, y  = ecdf(n_defaults)

# Plot the ECDF with labeled axes
_ = plt.plot(x, y, marker = '.', linestyle = 'none')
_ = plt.xlabel('number of defaults out of 100')
_ = plt.ylabel('CDF')

plt.show()
# Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
n_lose_money = np.sum(n_defaults >= 10)

# Compute and print probability of losing money
print('Probability of losing money =', n_lose_money / len(n_defaults))


## 9.2.1 Poisson processes and the Poisson distribution 

**Poisson process --> *The timing of the next event is completely independent of when the previous event happened.***

*Example:* natural birth in a given hospital; hit a website in a given hour.

--> The number of arrivals of "Poisson process" in a given amount of time is "Poisson distribution"

**Poisson distribution --> has one parameter, the average number of arrivals in a given length of time.**

*Example:* the number (r) of hits on a website in one hour with average hit rate of (6) hits per hour is poisson distributed.

*The poisson distribution is a limit of the binomial distribution for low probability of success and large number of trials. That is,for rare events*

**Note(1): The (std) of the binomial distribution gets closer and closer to that of the poisson distribution as the probability (p) gets lower and lower.**

**Note(2): When we've rare events that is ( low(p) & high(n) ), the binomial distribution is poisson distribution.** 

np.random.poisson(mean, size = 10000) === samples 

In [27]:
# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(10, size = 10000)

print('Poisson:     ', np.mean(samples_poisson),
                       np.std(samples_poisson))

# specify values of n and p to consider for binomial: n, p

n = [20,100, 1000]
p = [0.5, 0.1, 0.01]

# Draw 10,0000 samples for each n,p pair: samples_binomial
for i in range(3):
    samples_binomial = np.random.binomial(n[i], p[i], size = 10000)
    
    print('n =', n[i], 'Binomial:', np.mean(samples_binomial),
                                    np.std(samples_binomial))

## 10. Continuous Variables 
**It's time to move onto continuous variables, such as those that can take on any fractional value. Many of the principles are the same, but there are some subtleties.**

**Probability density functions (PDF) --> continuous analog of (PMF), it describes the chance of observing a value of continuous variable. It's also mathematical description of the relative likelihood of observing a value of a continuous variable.**

*Example: * Michelson's speed of light experiment, 100 measurements of the speed of light in air.Each measurement has some error in it; conditions, such as temperature, humidity, alignment of his optics, change from measurement to measurement.

*So the probability of observing a single value of the speed of light does not make sense, because there is an infinity of numbers, say  between 299.6 and 300.0 megameters per second.Instead, **areas under the PDF give probabilities**. *

*So, the probability of measuring that the speed of light is greater than 300,000 km/s is an area under the normal curve. (3%) chance.*

To do this calculation, we were really just looking at the cumulative distribution function (CDF), of the Normal distribution.

In [28]:
## sol == Speed Of Light 
sol = pd.read_csv("../input/inputdatasets/michelson_speed_of_light.csv")
sol.head()
### We are intrested in the "velocity of light in air (km/s)" column 

In [29]:
## PDF
_ = sns.distplot(sol['velocity of light in air (km/s)'])
_ = plt.xlabel('Speed of light (km/s)')
_ = plt.ylabel('PDF')

In [30]:
## CDF
x, y = ecdf(sol['velocity of light in air (km/s)'])

_ = plt.plot(x, y, marker = '.', linestyle = 'none')
_ = plt.xlabel('Speed of light (km/s)')
_ = plt.ylabel('CDF')


## 10.1 Introduction to the Normal distribution 

**Normal Distribution --> describes a continuous variable whose PDF has a single symmetric peak.**

***The normal distribution is parametrized by two parameters.***

The (Mean) determines where the center of the peak is.

Thw (standard deviation) is a measure of how wide the peak is, or how spread out the data are.

**Comparing the histogram to the (PDF) suffers from binning bias, so it is better to compare the (ECDF) of the data to the theoritical (CDF) of the normal distribution.**

**the mean & std computed from the data are a good estimate, so we'll compute them and pass them into the function --> np.random.normal(mean, std, size = 10000)**

*To compute "Theoritical CDF" --> np.random.normal(mean, std, size = 10000)*

To draw samples and then we can compute the CDF.

Finally, we plot both the theoretical and empirical CDFs on the same plot.

**NOTE: *that the mean & std are the names of the parameters of the normal distribution. Do not confuse these with the mean & std that we computed directly from the data when doing EDA.***

<img src="https://cdn-images-1.medium.com/max/2600/1*IdGgdrY_n_9_YfkaCh-dag.png" alt="Drawing" style="width: 700px;"/>

In [32]:
## Checking Normality of Michelson data

# the mean & std computed from the data are good estiates, so we'll compute them and pass them into the function.
mean = np.mean(sol['velocity of light in air (km/s)'])
std = np.std(sol['velocity of light in air (km/s)'])
# normally distributed theoretical samples 
samples = np.random.normal(mean, std, size = 10000)

x,y = ecdf(sol['velocity of light in air (km/s)'])

x_theor, y_theor = ecdf(samples)

sns.set()

_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle = 'none')

_ = plt.xlabel("Speed of light (km/s)")
_ = plt.ylabel("CDF")


with the absence of binning bias, it is much clearer that the "Michelson" data are Normally distributed.

In [40]:
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20,1,size=100000)
samples_std3 = np.random.normal(20,3,size=100000)
samples_std10 = np.random.normal(20,10,size=100000)



# Make histograms
sns.distplot(samples_std1)
sns.distplot(samples_std3)
sns.distplot(samples_std10)


# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()
