# Statistical Data Analysis

Statistical data analysis can be performed in two ways:
1. Descriptive Statistical Analysis
This type of analysis is done to summarize the data from a sample. It attempts at illustrating the relationship between variables in a sample or population and gives summary.
Eg: Mean, Median, Mode, Standard deviation, and Variance etc.
2. Inferential Statistical Analysis
This type of analysis is done for making conclusions from the data sample by using null and alternative hypotheses that are subject to random variation. It takes a random sample of data from a population and explains inferences about the whole population.
Eg: Probability distribution, Correlation testing and Regression Analysis etc.

In this chapter, let us consider a few data sets to perform statistical data analysis and draw inferences.

# Descriptive Statistics

Data set - Retail Sales Data from Kaggle. <br> 
Ref: https://www.kaggle.com/ashkash247/retail-sales-data

In [4]:
import pandas as pd
df = pd.read_csv('./Retail_Sales_Data.csv')

In [5]:
df.head()

Unnamed: 0,Transaction_ID,Customer_ID,State,Age,Shop_Category,Sales,Gender,Items_in_basket
0,1,1234,MP,10,Grocery,10,M,2
1,2,1235,UP,21,Dairy,30,F,3
2,3,1236,AP,23,Deli,23,F,4
3,4,1237,RP,25,Meat,21,F,4
4,5,1238,DP,27,Clothes,90,F,3


We can observe that there are 8 variables/columns in the data.

In [6]:
df.shape

(312, 8)

The data contains 312 records and 8 variables (or) the dimensionality of the data is 312 x 8.

## Measures of Frequency

Measures of frequency generally help to determing the counts, frequency or percentage of occurences of the variables in the data.

    1. Count
    2. Percent
    3. Frequency
    

### Q1. How many transactions were made by female and male?

In [8]:
genderDist = df['Gender'].value_counts()
print(genderDist)

F    209
M    103
Name: Gender, dtype: int64


<font color="red">Ans(1). We can observe that, of all the transactions, 209 transactions were made by female and 103 by male.</font>

### Q2. What percent is male and female?

In [10]:
malePercent = genderDist[1] * 100 / (genderDist[0] + genderDist[1])
femalePercent = genderDist[0] * 100 / (genderDist[0] + genderDist[1])
print("Male %: " + str(malePercent))
print("Female %: " + str(femalePercent))

Male %: 33.01282051282051
Female %: 66.98717948717949


### Q3. How many transactions were made from each state? Which state has highest number of transactions?

In [9]:
stateTrans = df["State"].value_counts()
print(stateTrans)

DP    97
AP    56
KP    52
RP    51
UP    28
MP    28
Name: State, dtype: int64


<font color="red">Ans(3). State DP has highest number of transactions.</font>

### Measures of Central Tendency
Measures of central tendency give one number that represents the entire set of scores, such as:
    1. Mean
    2. Median
    3. Mode

### Q4. What are average number of items purchased in the given set of data?

In [31]:
print(df["Items_in_basket"].mean())

4.336538461538462


<font color="red"> Average number of items in basket were around 4. </font>

### Q5. What age groups have often purchased?

In [34]:
print(df["Age"].mode()[0])

30


### Q6. What is the median score of items purchased in a transaction?

In [35]:
print(df["Items_in_basket"].median())

4.0


<font color="red">Ans(6). The median score for items in a basket distribution is 4. From Q3 we can observe that mean is also around 4 for this data. This could mean that the data is more or less evenly distributed from lowest to highest values. </font>

## Measures of Variability
Measures of variability indicate the degree to which scores differ around the average. These measures help to analyse how spread the data is.
    1. Range
    2. Variance
    3. Standard Deviation

### Q7. What is the range of items purchased in a transaction?

In [39]:
print("Range of items in basket\t" + str(min(df["Items_in_basket"])) +" - " + str(max(df["Items_in_basket"])))

Range of items in basket	2 - 23


### Q8. What is the variance and standard deviation of an item purchased in a transaction?

In [42]:
print(df["Items_in_basket"].var())
print(df["Items_in_basket"].std())

5.490879297551306
2.343262532784431


<font color="red">Variance is 5.49 and shows how far the data is spread out from the mean. Standard deviation shows how much variation (dispersion, spread, scatter) from the mean exists and for our data it is 2.34.</font>

## Measures of Position
A measure of position determines the position of a single value in relation to other values in a sample or a population data set. Unlike the mean and the standard deviation, descriptive measures based on quantiles are not sensitive to the influence of a few extreme observations. These measures can be:
    
    1. Percentiles
    2. Quartiles
    
The pth percentile of the data set is a measurement such that after the data are ordered from smallest to largest, at most, p% of the data are at or below this value and at most, (100 - p)% at or above it.

The median is the value where fifty percent or the data values fall at or below it. Therefore, the median is the 50th percentile.

There are two other important percentiles. The 25th percentile, typically denoted, Q1, and the 75th percentile, typically denoted as Q3. Q1 is commonly called the lower quartile and Q3 is commonly called the upper quartile.
    

### Q9. What are the 25th (quartile Q1), 50th (quartile Q2), and 75th (quartile Q3) percentiles of the items in the basket data?

In [45]:
print(df["Items_in_basket"].quantile(q=0.25))
print(df["Items_in_basket"].quantile(q=0.50))
print(df["Items_in_basket"].quantile(q=0.75))

3.0
4.0
5.0


<font color="red">We can observe that 25th percentile is at value - 3 which means 25% of the data is below value 3. Likewise, 50% of the data is below 4 and 75% of the data is below 5.
Also, we can clearly notice that 50th percentiles is equal to median from Q6.</font>