### Descriptive Statistics 
#### Hugo Franco
#### Data Analytics, 2021-2S


### Some data load examples

The first step in data analysis is to collect data and arrange it in a convenient format for further analytic processes. The next line indicates how to load and save files with special formats.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb #future sessions

##### .csv Import


In [None]:
dataframe1=pd.read_csv("Yayacon.csv", delimiter='\t' )

In [None]:
dataframe1.head()

#### Formatted Text Import

In [None]:
import pandas as pd
df2=pd.read_csv("animals.txt", delimiter='|' )

In [None]:
df2

#### MS Excel import (.xlsx)  y  (.xls)

In [None]:
#it is required to install the libraries 'openpyxl' and 'lxml'

import pandas as pd
df3=pd.read_excel("93CAR.xlsx", sheet_name="Hoja1",engine="openpyxl")

In [None]:
df3.head()

##### URL data import


Pandas can read tabular data stored in hypertext documents, using the proper URL information and matching parametes, e.g.

https://www.marketwatch.com/investing/stock/aapl/financials 
    

In [None]:
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
pd.read_html(url, match="Net Income")[0]

### Statistical data description

In [None]:
import pandas as pd
df4=pd.read_excel("93CAR.xlsx", sheet_name="Hoja1", engine="openpyxl")
print("\nFirst 5 records:")
df4.head()


In [None]:
print("\nVariable information:")
df4.info()

print("\nBasic statistical description per variable")
df4.describe()


The low-level variable structure in the database could be accessed as follows

In [None]:
vars(df4)

 Variables in the dataframe can be accessed and explored individually by their label.
 
 For example, two qualitative variables, __Vehicle Type__ and __Drive Train__

In [None]:
VehicleType=df4["Type"]
VehicleType.head()

In [None]:
VehicleType.describe()


In [None]:
Train=df4["Drive Train"]
Train.head()

In [None]:
Train.describe()

 ...and two quantitative variables, __Max Price__ and __Power in HP__

In [None]:
MaxPrice=df4["Max Price"]
MaxPrice.head()

In [None]:

MaxPrice.describe()

In [None]:
HorsePw=df4["Horsepower"]
HorsePw.head()

In [None]:
HorsePw.describe()

### Statistical report

#### Qualitative variables


Frequency tables:

In [None]:
pd.value_counts(VehicleType)

In [None]:
# Percent version
100 * VehicleType.value_counts() / len(VehicleType)

Absolute frequency plot bars

In [None]:
plot = VehicleType.value_counts().plot(kind='bar', title='Vehicle Type')

Relative frequency plot bars

In [None]:
plot = (100 * VehicleType.value_counts() / len(VehicleType)).plot(kind='bar', title='Vehicle Type (%)')

Pie diagram for vehicle type:

In [None]:
plot = VehicleType.value_counts().plot(kind='pie', autopct='%.2f', figsize=(6, 6), title='Vehicle Type')

#### Quantitative variables


Histogram for maximum price

In [None]:
import numpy as np

bins = np.arange(7.9, 80, 10)

freq = pd.cut(MaxPrice, bins)

table_freq = pd.value_counts(freq)
print(table_freq)

In [None]:
import matplotlib.pyplot as plt

MaxPrice.hist(bins=10) 

plt.xlabel("Maximum Price")
plt.ylabel("Frecuency")
plt.show()

### Bivariate analyis

#### Qualitative variables
Contingency table __Vehicle type__ vs. __Drive train__

In [None]:
pd.crosstab(index=Train, columns=VehicleType, margins=True)

In [None]:
## Relative version (percents)
pd.crosstab(index=Train, columns=VehicleType, margins=True).apply(lambda r: r/len(VehicleType) *100, axis=1)

#### Comparative bars
__Vehicle type__ vs. __Drive train__

In [None]:
plot = pd.crosstab(index=VehicleType, columns=Train, margins=True).apply(lambda r: r/r.sum() *100, axis=1).plot(kind='bar')

#### Stacked bars
__Vehicle type__ vs. __Drive train__

In [None]:
# Gráfico de barras apiladas de de TipoVehi según Traccion:

plot = pd.crosstab(index=VehicleType, columns=Train, margins=True).apply(lambda r: r/r.sum() *100,axis=0).plot(kind='bar', stacked=True)

### Quantitative variable analysis

#### Box plots for grouped data
__Power in HP__ vs __Drive train__

In [None]:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
cajas=sns.boxplot(y=list(HorsePw), x=(Train), hue=Train)
cajas.set_title("Distribution of Power according to Drive Train")