<a href="https://colab.research.google.com/github/RajenderDBC/An-Introduction-to-Statistics-with-Python-With-Applications-in-the-Life-Sciences/blob/main/PDS_CH02_MDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Example tweet
tweet = "This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies."

# Create the vectorizer
vectorizer = CountVectorizer()

# Fit and transform the tweet (it needs to be inside a list)
word_counts = vectorizer.fit_transform([tweet])

# Get the words and their counts
words = vectorizer.get_feature_names_out()
counts = word_counts.toarray()[0]

# Display results
for word, count in zip(words, counts):
    print(f"{word}: {count}")

afloat: 1
are: 1
crescent: 1
dawn: 1
early: 1
east: 1
in: 1
joins: 1
look: 1
moon: 1
morn: 1
rise: 1
saturn: 1
skies: 1
the: 2
then: 1
this: 1
to: 1
venus: 1
wednesday: 1
you: 1


## Relative length of text

tweet= "This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies."
# get the length of this text (number of characters for a string)


In [None]:
tweet = "This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies."

In [None]:
len(tweet)

125

The average tweet, as discovered by analysts, is about 30 characters in length. So, we might impose a new characteristic, called relative length, (which is the length of the
tweet divided by the average length),

In [None]:
relative_len = len(tweet)/30.0
print(relative_len)

4.166666666666667


## Example – world alcohol consumption data

In [None]:
import pandas as pd

# read in the CSV file from a URL
drinks = pd.read_csv('https://raw.githubusercontent.com/RajenderDBC/MathDataSci_dataset/refs/heads/main/drinks.csv')

# examine the data's first five rows
drinks.head()   # print the first 5 rows

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [None]:
#Analyzing DataFrame "continent"

drinks['continent'].describe()

Unnamed: 0,continent
count,193
unique,6
top,AF
freq,53


In [None]:
drinks['beer_servings'].describe()

Unnamed: 0,beer_servings
count,193.0
mean,106.160622
std,101.143103
min,0.0
25%,20.0
50%,76.0
75%,188.0
max,376.0


## Median Example

In [None]:
import numpy as np

results = [5, 4, 3, 4, 5, 3, 2, 5, 3, 2, 1, 4, 5, 3, 4, 4, 5, 4, 2, 1, 4, 5, 4, 3, 2, 4, 4, 5, 4, 3, 2, 1]

sorted_results = sorted(results)

In [None]:
print(sorted_results)

[1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]


In [None]:
print(np.mean(results)) # == 3.4375

3.4375


In [None]:
print(np.median(results)) # == 4.0

4.0


## Mean



In [None]:
import numpy
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
mean = np.mean(temps)
print("Mean:", mean)

Mean: 30.733333333333334


In [None]:
# Differences from mean
diffs = [x - mean for x in temps]
print("Differences from mean:", diffs)



Differences from mean: [np.float64(0.2666666666666657), np.float64(1.2666666666666657), np.float64(1.2666666666666657), np.float64(0.2666666666666657), np.float64(-2.7333333333333343), np.float64(-1.7333333333333343), np.float64(0.2666666666666657), np.float64(7.266666666666666), np.float64(1.2666666666666657), np.float64(0.2666666666666657), np.float64(-0.7333333333333343), np.float64(-1.7333333333333343), np.float64(-0.7333333333333343), np.float64(0.2666666666666657), np.float64(-4.733333333333334)]


In [None]:
# Squared differences
squared_diffs = [d**2 for d in diffs]
print("Squared differences:", squared_diffs)

Squared differences: [np.float64(0.07111111111111061), np.float64(1.6044444444444421), np.float64(1.6044444444444421), np.float64(0.07111111111111061), np.float64(7.471111111111116), np.float64(3.004444444444448), np.float64(0.07111111111111061), np.float64(52.80444444444443), np.float64(1.6044444444444421), np.float64(0.07111111111111061), np.float64(0.5377777777777791), np.float64(3.004444444444448), np.float64(0.5377777777777791), np.float64(0.07111111111111061), np.float64(22.404444444444454)]


In [None]:
# Variance (Population formula: divide by N)
variance = sum(squared_diffs) / len(temps)
print("Variance:", variance)

Variance: 6.328888888888887


In [None]:
# Standard deviation (Population)
std_dev = variance**0.5
print("Standard deviation:", std_dev)

Standard deviation: 2.5157283018817607


## Geometric Mean

In [None]:
import numpy as np

# Data
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]

# Number of items
n = len(temps)

# Calculate the product of all temperatures
product = 1
for temp in temps:
    product *= temp

# Geometric mean formula: (product of values)^(1/n)
geometric_mean = product ** (1.0 / n)

# Output
print("Geometric Mean:", geometric_mean)

Geometric Mean: 30.63473484374659
