<a href="https://colab.research.google.com/github/MK316/Spring2024/blob/main/Seminar/Seminar01B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸŒ± **2. Descriptive statistics**

1. What does 'descriptive statistics' refer to?
2. **Key measures**
3. **Visualize data**
4. **Real data for practice**
5. Applications and limitations


## **2.2 Key measures**

+ Mean
+ Median
+ Mode
+ Range
+ Variance
+ Standard deviation
+ Interquartile

* ðŸ”Ž Python packages to use: {numpy}, {scipy}

In [None]:
import numpy as np
from scipy import stats

In [None]:
#@markdown Generate a sample data set: **data** = 1~100

data = list(range(1,101))
str(data)

* Mean

In [None]:
mean = np.mean(data)
print(mean)

* Median

In [None]:
median = np.median(data)
print(median)

* Mode

In [None]:
mode = stats.mode(data)
print(mode)

In [None]:
data_range = np.ptp(data) # Peak-to-peak (max - min)

* Variance

In [None]:
data_var = np.var(data)
data_var

In [None]:
data_std = np.std(data)
data_var
print("STD: ", data_std)
print("Variance: ", data_var)

* Interquartile Range (IQR): 75% - 25%

In [None]:
iqr = np.percentile(data, 75) - np.percentile(data, 25)
print(iqr)


**Using data.describe()**

**Note:** To use df.describe(), df should be a dataframe.

In [None]:
import pandas as pd

df=pd.DataFrame({'Column_Name': data}) # From  list to dataframe
df.describe()

## **2.3 Visualize data**

1. Histogram
  * Density graph
  * Normality test

2. Box plot (or Box-and-Whisker plot)
3. Scatter plot
4. Bar plot
5. Pie chart
6. Line graph
7. Radar chart
8. Heatmap
9. Violin plot
10. Pareto plot
11. Stem-and-leaf plot
12. Time series plot
13. Frequency polygon
14. Dot plot

Plot to save
> **plt.savefig('filename.png', dpi=300)**

> option: bbox_inches='tight' (reduce the margine)

### [1] Histogram

* Data distribution: {matplotlib}

In [None]:
import matplotlib.pyplot as plt

In [None]:
import numpy as np

# Generate a list of 100 observations from a normal distribution with mean 0 and standard deviation 1
mean = 68
std_dev = 12
data = np.random.normal(mean, std_dev, 100)

print(data)

* Histogram

In [None]:
plt.hist(data, bins=10, edgecolor='k')
# Save the plot to a file (e.g., 'box_plot.png')
plt.savefig('histogram.png', dpi=300)  # You can change the filename and format


* **Density line** to the histogram

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

plt.hist(data, bins=10, edgecolor='k', density=True, color = "orange")  # Set density=True
plt.xlabel('Values')
plt.ylabel('Density')
plt.title('Histogram with Density Line')

# Calculate parameters for the density line (mean and standard deviation)
mean = np.mean(data)
std_dev = np.std(data)

# Generate data points for the density line
x = np.linspace(min(data), max(data), 100)
density = norm.pdf(x, mean, std_dev)

# Plot the density line
plt.plot(x, density, 'r-', label='Density Line', color = "blue")

# Display legend
plt.legend()

# Show the plot
plt.show()

* **Normality test:** Shapiro-Wilk normality test using {scipy}

In [None]:
from scipy import stats

# Perform the Shapiro-Wilk normality test
statistic, p_value = stats.shapiro(data)

# Set the significance level (alpha)
alpha = 0.05

# Print the test result
print("Shapiro-Wilk Test Result:")
print(f"Statistic: {statistic}")
print(f"P-value: {p_value}")

### [2] Box plot

In [None]:
import matplotlib.pyplot as plt

# Create a box plot
plt.boxplot(data)
plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Box Plot of Data')
plt.show()

### [3] Scatter plot: this requires two data sets

In [None]:
import matplotlib.pyplot as plt

# Example data
x = [1, 2, 3, 4, 5]
y = [10, 12, 8, 15, 9]

# Create a scatter plot
plt.scatter(x, y)
#plt.scatter(x,y, marker='o', color='red', s=100, label='Data Points')  # Change marker, color, and size)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.grid(True)  # Add a grid (optional)

# Set x-axis and y-axis limits (xlim and ylim)
plt.xlim(0, 6)  # Set x-axis limits from 0 to 6
plt.ylim(5, 20)  # Set y-axis limits from 5 to 20

# Show the plot
plt.show()


### [4] Bar plot

In [None]:
import matplotlib.pyplot as plt

# Example data
categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [10, 15, 8, 12]

# Create a bar plot
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot')
#plt.grid(axis='x')  # Add a horizontal grid (axis = 'y) or 'both')

# Show the plot
plt.show()


### [5] Pie chart

In [None]:
import matplotlib.pyplot as plt

# Example data
labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [30, 20, 15, 35]
colors = ['blue', 'green', 'red', 'orange']
explode = (0.1, 0, 0, 0)  # Explode the first slice (Category A)
#explode = (0, 0.1, 0, 0)  # Explode the second slice (Category B)

# Create a pie chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Pie Chart')

# Show the plot
plt.show()


### [6] Line graph

In [None]:
import matplotlib.pyplot as plt

# Example data
x = [1, 2, 3, 4, 5]
y = [10, 12, 8, 15, 9]

# Create a line graph
plt.plot(x, y, marker='o', linestyle='-', color='blue', label='Line Plot')  # Customize markers, linestyle, and color
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Graph')
plt.grid(True)  # Add a grid (optional)

# Show the plot
plt.legend()  # Display the legend (optional)
plt.show()


### [7] Radar chart

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Example data for different categories
categories = ['Category A', 'Category B', 'Category C', 'Category D', 'Category E']

# Data values for a single data point (e.g., an individual or group)
data_values = [4, 3, 5, 2, 4]

# Number of categories
num_categories = len(categories)

# Calculate the angle for each category
angles = np.linspace(0, 2 * np.pi, num_categories, endpoint=False).tolist()
angles += angles[:1]  # Close the circle

# Create a radar chart
plt.figure(figsize=(6, 6))
ax = plt.subplot(111, polar=True)  # Specify polar projection

# Repeat the first data point to create a closed loop
data_values += data_values[:1]

# Plot the data
ax.fill(angles, data_values, 'b', alpha=0.1)  # Fill the area under the curve
ax.set_thetagrids(np.degrees(angles[:-1]), labels=categories)  # Set category labels

# Add a title
plt.title('Radar Chart')

# Show the plot
plt.show()


### [8] Heatmap: using {seaborn}

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Imaginary data for word frequency in books
data = pd.DataFrame({
    'Book 1': [10, 20, 30, 40, 50],
    'Book 2': [15, 25, 35, 45, 55],
    'Book 3': [5, 10, 15, 20, 25],
    'Book 4': [25, 15, 30, 10, 5],
    'Book 5': [30, 40, 20, 10, 15]
})

# Create a heatmap using seaborn
plt.figure(figsize=(10, 6))  # Set the figure size
sns.heatmap(data, annot=True, fmt='d', cmap='coolwarm', cbar=True)

# Customize labels and title
plt.xlabel('Books')
plt.ylabel('Word Frequency')
plt.title('Word Frequency in Books')

# Show the heatmap
plt.show()


### [9] Violin plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Example data
data = sns.load_dataset("tips")

# Create a violin plot
plt.figure(figsize=(8, 6))  # Set the figure size
sns.violinplot(x="day", y="total_bill", data=data, inner="quart")

# Customize labels and title
plt.xlabel('Day of the Week')
plt.ylabel('Total Bill Amount')
plt.title('Violin Plot Example')

# Show the violin plot
plt.show()


### [10] Pareto plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Example data
data = {
    'Category': ['Category A', 'Category B', 'Category C', 'Category D', 'Category E'],
    'Count': [40, 30, 20, 10, 5]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Sort the DataFrame by 'Count' in ascending order
df = df.sort_values(by='Count', ascending=True)

# Calculate cumulative percentage
df['Cumulative Percentage'] = (df['Count'].cumsum() / df['Count'].sum()) * 100

# Specify bar colors
bar_colors = ['blue', 'green', 'red', 'purple', 'orange']

# Create a Pareto plot
plt.figure(figsize=(10, 6))
ax = plt.subplot()

# Bar plot for 'Count' with custom colors
ax.bar(df['Category'], df['Count'], color=bar_colors, alpha=0.7, label='Count')

# Line plot for 'Cumulative Percentage'
ax.plot(df['Category'], df['Cumulative Percentage'], color='gray', marker='o', label='Cumulative Percentage')

# Customize labels and title
plt.xlabel('Categories')
plt.ylabel('Count / Cumulative Percentage')
plt.title('Pareto Plot (Ascending Order)')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Show the Pareto plot with legend
plt.legend()
plt.show()


### [11] Stem-and-leaf plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Example data (a list of numerical values)
data = [12, 27, 35, 46, 53, 67, 71, 83, 94]

# Extract the leading digits (stems) and trailing digits (leaves)
stems = [int(str(x)[:-1]) for x in data]
leaves = [int(str(x)[-1]) for x in data]

# Create a stem-and-leaf plot with orange color
plt.figure(figsize=(8, 6))  # Set the figure size
plt.stem(stems, leaves, markerfmt="o", basefmt=" ", use_line_collection=True, linefmt="orange", bottom=-1, label='Data')

# Customize labels and title
plt.xlabel('Stems')
plt.ylabel('Leaves')
plt.title('Stem-and-Leaf Plot')

# Show the plot
plt.legend()
plt.show()


### [12] Time series plot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Create a date range
dates = pd.date_range(start='2023-01-01', periods=50, freq='D')

# Step 2: Generate random data to simulate time series data
data = np.random.randn(50).cumsum()

# Step 3: Create a DataFrame
df = pd.DataFrame(data, index=dates, columns=['Value'])

# Step 4: Plot the time series data
plt.figure(figsize=(10, 6))
plt.plot(df.index, df['Value'], marker='o', color='blue')
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()


### [13] Frequency polygon: a graphical representation used to visualize the distribution of a dataset.

**Interpretation:**

+ Peaks: Peaks in a frequency polygon represent the most frequent data points (the mode). A distribution can have one peak (unimodal), two peaks (bimodal), or multiple peaks.

+ Spread: The width of the polygon indicates the spread or variance of the data. A wider polygon means a greater spread.

+ Skewness: The asymmetry of the polygon indicates skewness. If it leans to the left or right, it suggests skewness in data.

**Uses**
+ Comparing Distributions: Frequency polygons are particularly useful when comparing multiple distributions. Different polygons can be plotted on the same axes for comparison.

+ Identifying Trends: They are helpful in identifying trends and patterns that might not be obvious in tabulated data.



In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Generate a dataset (you can replace this with your own dataset)
data = np.random.normal(loc=0, scale=1, size=1000)

# Step 2: Compute the histogram (frequency distribution) of the data
counts, bin_edges = np.histogram(data, bins=10, density=True)

# Step 3: Compute the bin centers (instead of edges)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])

# Step 4: Plot the frequency polygon
plt.figure(figsize=(8, 6))
plt.plot(bin_centers, counts, marker='o', linestyle='-')
plt.title('Frequency Polygon')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


### [14] Dot plot

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Example dataset
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

# Create a figure and a set of subplots
fig, ax = plt.subplots()

# Count the occurrence of each item in the dataset
values, counts = np.unique(data, return_counts=True)

# Plot each data point
for value, count in zip(values, counts):
    ax.plot([value] * count, range(count), 'o', color='blue')

# Set labels and title
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
ax.set_title('Dot Plot')

# Show the plot
plt.show()


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Generate a random dataset of English grades for 30 students
np.random.seed(0)  # For reproducibility
grades = np.random.randint(56, 99, size=30)  # Grades between 56 and 98

# Step 2: Sort the grades in decreasing order
sorted_indices = np.argsort(-grades)  # Indices of students after sorting grades in decreasing order

# Step 3: Create a dot plot
fig, ax = plt.subplots()
# Plotting students' numbers on the x-axis against their sorted grades
ax.plot(range(1, len(sorted_indices) + 1), grades[sorted_indices], 'o', color='blue')

# Set the labels and title
ax.set_xlabel('Students (sorted by grade)')
ax.set_ylabel('Grades')
ax.set_title('Dot Plot of Students\' English Grades')

# Set x-axis ticks
x_ticks = list(range(1, len(sorted_indices) + 1, 5))
if len(sorted_indices) not in x_ticks:  # Ensure the last number is included
    x_ticks.append(len(sorted_indices))
ax.set_xticks(x_ticks)

# Set y-axis limits
ax.set_ylim(0, 100)

# Show the plot
plt.show()


##ðŸ˜˜ **2.3 Real data for practice**

### 1. Histogram, Density Graph, Normality Test

In [None]:
import numpy as np

# Generate normally distributed heights of 500 individuals
heights = np.random.normal(170, 10, 500)  # mean=170cm, std=10cm


### 2. Box Plot

In [None]:
import numpy as np

# Exam scores in 4 different subjects for 50 students
scores_math = np.random.randint(50, 100, 50)
scores_science = np.random.randint(55, 100, 50)
scores_history = np.random.randint(40, 100, 50)
scores_english = np.random.randint(60, 100, 50)


### 3. Scatter Plot

In [None]:
import numpy as np

# Height and Weight of 100 individuals
height = np.random.normal(170, 10, 100)  # Height in cm
weight = height * 0.5 + np.random.normal(0, 5, 100)  # Weight in kg


### 4. Bar Plot

In [None]:
import numpy as np

# Average monthly sales data for a year
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
sales = np.random.randint(1000, 5000, 12)  # Sales in units


5. Pie Chart

In [None]:
import numpy as np

# Market share of 5 companies
companies = ['Company A', 'Company B', 'Company C', 'Company D', 'Company E']
market_share = np.random.rand(5)
market_share /= market_share.sum()


### 6. Line Graph

In [None]:
import numpy as np

# Yearly average temperature for 10 years
years = np.arange(2010, 2020)
temperature = np.random.uniform(14, 20, len(years))  # Temperature in Celsius


### 7. Radar Chart

In [None]:
import numpy as np

# Performance metrics for 3 employees in 5 areas
labels=np.array(['Efficiency', 'Quality', 'Commitment', 'Responsibility', 'Teamwork'])
performance_A = np.random.randint(1, 5, 5)
performance_B = np.random.randint(1, 5, 5)
performance_C = np.random.randint(1, 5, 5)


### 8. Heatmap

In [None]:
import numpy as np

# Correlation matrix for 6 variables
correlation_matrix = np.random.uniform(-1, 1, (6, 6))
correlation_matrix = (correlation_matrix + correlation_matrix.T) / 2  # Making it symmetric
np.fill_diagonal(correlation_matrix, 1)  # Fill diagonal with 1s for correlation


### 9. Violin Plot

In [None]:
import numpy as np

# Three different groups of data
group1 = np.random.normal(20, 5, 100)
group2 = np.random.normal(30, 10, 100)
group3 = np.random.normal(40, 15, 100)


### 10. Pareto Plot

In [None]:
import numpy as np

# Complaint types and their frequencies
complaint_types = ['Type A', 'Type B', 'Type C', 'Type D', 'Type E']
frequencies = np.random.randint(10, 100, len(complaint_types))


### 11. Stem-and-Leaf Plot

In [None]:
import numpy as np

# Random integers representing some measurement
measurements = np.random.randint(10, 99, 50)


### 12. Time Series Plot

In [None]:
import numpy as np
import pandas as pd

# Daily stock prices for a month
dates = pd.date_range(start='2023-01-01', end='2023-01-31')
stock_prices = np.random.uniform(low=100, high=200, size=len(dates))


### 13. Frequency Polygon

In [None]:
import numpy as np

# Test scores for a group of students
test_scores = np.random.randint(0, 100, 100)


#The END