## Statistical Methods Using Python for Analyzing Stocks

Abstract Statistical Methods are part of the tools for analyzing securities. The following chapter explains the central limit theorem, returns, ranges, boxplots, histograms and other sets of statistical measures for the analysis of securities using Yahoo Finance API.

Keywords Central limit theorem · Returns · Plots · Statistical measures

This next part of the book is centered on the use of mathematical and statistical methods to understand the security based on quantitative analysis. The aim of quantitative analysis is to extract a value that explains financial behavior (Keaton 2019).

### The Central Limit Theorem

The Central Limit Theorem (CLT) is part of the study concerning probability theory which states that if random samples of a certain size (n) from any population, the sample will approach a normal distribution. A normal distribution happens when there is no left or right bias in the
data (Ganti 2019).

The usual representation of a normal distribution is the Bell Curve which looks lithe a bell, hence the name. CLT establishes that given a sufficiently large sample size from a population with a specific amount of variance that is finite, the mean is equal to the median and equal to the mode. The meaning of this is that there is complete symmetry in the data and that 50% of the values are higher than the mean and 50% are lower than the mean.

The mean is usually divided into two: (1) population mean and (2) sample mean. These aspects are important when analyzing the data that is going to be used. The formulas give information regarding these aspects:


In [17]:
import yfinance as yf
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
import pandas_datareader.data as web
import matplotlib.pyplot as plt
%matplotlib inline

start = datetime.datetime(2015,1,1)
end = datetime.datetime(2019,1,1)
IBM = yf.download('IBM', start, end)

[*********************100%%**********************]  1 of 1 completed


Given that the information used for the IBM security is from January 1st, 2015 to January 1, 2019, this is considered as a sample. Therefore, the x¯ will be considered as the mean of the security. To calculate the mean, as the equation suggests, it is the sum of the elements divided by
the count of the elements.


In [18]:
print(IBM['Close'].mean())

145.16948027260023


The other aspect mentioned on the CLT is the median. The median is the middle value of the set of numbers. To calculate the median in Python it should be calculated as follows:

In [19]:
print(IBM['Close'].median())

145.79827880859375


Which one to choose? The rule of thumb is to use the mean when there are no outliers and to use the median when there are outliers.

The third measure of the CLT is the mode. The mode is the most frequent point of data in our data set. In a histogram, it is the highest bar.

To calculate it in Python:

In [20]:
print(IBM['Close'].mode())

IBM['Close'].describe()

0    140.038239
Name: Close, dtype: float64


count    1006.000000
mean      145.169480
std        12.793130
min       102.839386
25%       138.499046
50%       145.798279
75%       153.319794
max       173.948380
Name: Close, dtype: float64