Q1: A researcher wants to determine if there is a significant difference in the mean test scores
among four different study groups (Group 1, Group 2, Group 3, and Group 4). The test scores are
recorded as follows:

Group 1: [85, 88, 82, 86, 90]

Group 2: [78, 80, 84, 76, 82]

Group 3: [92, 94, 91, 88, 96]

Group 4: [75, 78, 80, 77, 81]

Perform an ANOVA test to determine if there are significant differences in mean test scores among
the four groups at a 5% significance level.

In [None]:
#H_o = There is no significant difference in mean among four groups.    => mu1 = mu2 = mu3 = mu4
#H_a = The mean of at least one group is different.

import scipy.stats as stats

g1 = [85, 88, 82, 86, 90]
g2 = [78, 80, 84, 76, 82]
g3 = [92, 94, 91, 88, 96]
g4 = [75, 78, 80, 77, 81]

alpha = 0.05

_, pvalue = stats.f_oneway(g1, g2, g3, g4)

print('P-value',pvalue)

if pvalue < alpha:
  print('Reject Null Hypothesis -> The mean of at least one group is not same.')
else :
  print('Fail to reject Null Hypothesis -> There is no significant difference in mean among four groups.')



P-value 3.875161644446216e-06
Reject Null Hypothesis -> The mean of at least one group is not same.


Q2: Find the Singular Value Decomposition (SVD) of the matrix

A = [

1 2

3 4

]

In [None]:
# Singular Value Decomposition (SVD) is a matrix factorization technique
# It decomposes a given matrix into three other matrices
# The three matrices involved in SVD are:
# U (Left Singular Vectors): U is an orthogonal matrix (U^T * U = I) that contains the left singular vectors of the original matrix.
# Σ (Diagonal Singular Values): Σ is a diagonal matrix containing the singular values of the original matrix. These singular values are non-negative and represent the "strength" or importance of each singular vector.
# V^T (Right Singular Vectors): V^T is the transpose of an orthogonal matrix V, containing the right singular vectors of the original matrix.
# Mathematically, the SVD of a matrix A is represented as:
# A = U * Σ * V^T

In [None]:
from numpy import linalg
import numpy as np
A = np.array([[1,2],[3,4]])
linalg.svd(A)

(array([[-0.40455358, -0.9145143 ],
        [-0.9145143 ,  0.40455358]]),
 array([5.4649857 , 0.36596619]),
 array([[-0.57604844, -0.81741556],
        [ 0.81741556, -0.57604844]]))

Q3: You are conducting a survey to collect data on the number of cars owned by households in a
neighborhood. The data ranges from 0 to 4 cars. Determine the type of data and calculate the
median.

The data is as follows:

0, 2, 4, 2, 0, 1, 2, 3, 2, 4, 2, 3, 1, 3, 4, 1, 4, 3

In [None]:
# Type of data --> Discrete

data = [0, 2, 4, 2, 0, 1, 2, 3, 2, 4, 2, 3, 1, 3, 4, 1, 4, 3]
np.median(data)

2.0

Q4: A manufacturer claims that their machine produces widgets with a mean weight of 500 grams.
You take a random sample of 36 widgets and find that the mean weight is 495 grams with a standard
deviation of 20 grams. Conduct a hypothesis test at a 1% significance level to determine if the
manufacturer's claim is accurate.

In [None]:
# H_o = Mean weight of weidgets are 500 grams
# H_a = Mean weight of weidgets are not 500 grams

population_mean = 500
sample_size = 36
sample_mean = 495
sample_std = 20
alpha = 0.01

z = (sample_mean - population_mean) / (sample_std/np.sqrt(sample_size))
pvalue = 2 * (1 - stats.norm.cdf(abs(z)))            # p-value for a two-tailed test (to test whether the observed value falls significantly above or below the mean)
                                                     # we are only interested in whether the mean weight is less than 500 grams
print('test-statstic(z) :', z)
print('P-value :',pvalue)

if pvalue < alpha:
  print('Reject Null Hypothesis -> Mean weight of weidgets are not 500 grams')
else:
  print('Fail to reject Null Hypothesis -> Mean weight of weidgets are 500 grams')


test-statstic(z) : -1.5
P-value : 0.13361440253771617
Fail to reject Null Hypothesis -> Mean weight of weidgets are 500 grams


Q5: In a factory, 80% of the products pass quality control, while 20% are rejected. Of the products
that pass quality control, 90% are shipped on time, while 10% are delayed. If a randomly chosen
product is delayed, what is the probability that it also failed quality control?

In [None]:
P_delayed = 0.10
P_rejected_qc = 0.20

P_delayed_given_failed_qc = P_delayed / P_rejected_qc
print("Probability of being delayed given that it is also failed QC:", P_delayed_given_failed_qc)

Probability of being delayed given that it is also failed QC: 0.5


Q6: A survey records the monthly incomes (in thousands of dollars) of a group of individuals.

Calculate the mean, median, and mode for the income data with the following
class intervals:

Income Range
(in thousands)  

10-20 20-30 30-40 40-50 50-60

Frequency

15 20 12 8 5

In [None]:
from statistics import mean, median, mode

Income_range = ['10-20', '20-30', '30-40', '40-50', '50-60']
freq = [15, 20, 12, 8, 5]

midpoint = [((int(k.split('-')[0]))+(int(k.split('-')[1])))/2 for k in Income_range]
# cumulative_freq = np.cumsum(freq)

data = []
for i in range(len(Income_range)):
    data.extend([midpoint[i]] * freq[i])

print('Mean :', mean(data))
print('Median :', median(data))
print('Mode :', mode(data))

Mean : 29.666666666666668
Median : 25.0
Mode : 25.0


Q7: A university claims that the average time it takes for a student to complete a specific test is 40
minutes. A random sample of 30 students takes the test, and the sample mean time is found to be 38
minutes with a sample standard deviation of 5 minutes. Test the university's claim at a 1%
significance level.

In [None]:
# H_o =  average time it takes for a student to complete a specific test is 40 minutes or more.
# H_a =  average time it takes for a student to complete a specific test is less then 40 minutes.

population_mean = 40
sample_size = 30
sample_mean = 38
sample_std = 5
alpha = 0.01

z = (sample_mean - population_mean) / (sample_std/np.sqrt(sample_size))
pvalue = 1-stats.norm.cdf(abs(z))          # p-value for a one-tailed test -> left (because we are checking if the observed vaue is less then the ,mentioned value or not)

print('Test-statstic :',z)
print('P-value :', pvalue)

if pvalue < alpha:
  print('Reject Null Hypothesis -> average time it takes for a student to complete a specific test is less then 40 minutes.')
else:
  print('Fail to Reject Null Hypothesis -> average time it takes for a student to complete a specific test is 40 minutes or more.')

Test-statstic : -2.1908902300206643
P-value : 0.014229868458155326
Fail to Reject Null Hypothesis -> average time it takes for a student to complete a specific test is 40 minutes or more.


Q8: Calculate Pearson's correlation coefficient for the following data pairs with continuous class
intervals:

Height (cm)

150-160 160-170 170-180 180-190 190-200

Weight (kg)

50 55 60 65 70

In [None]:
Height = ['150-160', '160-170', '170-180', '180-190', '190-200']
weight = [50, 55, 60, 65, 70]

midpoint_h = [((int(h.split('-')[0]))+(int(h.split('-')[1])))/2 for h in Height]

print('correlation coefficient for the following data pairs :', np.corrcoef(midpoint_h, weight)[0,1])

correlation coefficient for the following data pairs : 1.0


Q9: Find the eigenvectors of the following 3x3 matrix:

A = [

2 1 0

1 3 1

0 1 2

]

In [None]:
a = np.array([[2, 1, 0],[1, 3, 1],[0, 1, 2]])
value, vector = np.linalg.eig(a)

print('Eigne Value :\n', value)
print('\nEigne Vector :\n', vector)

Eigne Value :
 [4. 2. 1.]

Eigne Vector :
 [[-4.08248290e-01  7.07106781e-01  5.77350269e-01]
 [-8.16496581e-01 -3.45742585e-16 -5.77350269e-01]
 [-4.08248290e-01 -7.07106781e-01  5.77350269e-01]]


Q10: The following data represents the scores of a group of students in a math test:

72, 85, 90, 68, 78, 92, 88, 76, 80, 84.

Calculate the interquartile range (IQR) for this dataset.

In [None]:
math = [72, 85, 90, 68, 78, 92, 88, 76, 80, 84]

q3 =np.percentile(math, 75)
q1 =np.percentile(math, 25)

IQR = q3 - q1

print('IQR :',IQR)

IQR : 10.75


Q1: Find the eigenvectors of the following 3x3 matrix:

A = [

2 1 1

1 2 1

1 1 2

]

In [None]:
arr = np.array([[2,1,1],[1,2,1],[1,1,2]])
value, vector = np.linalg.eig(arr)

print('Eigen Value :\n', value)
print('\nEigen Vector :\n', vector)

Eigen Value :
 [1. 4. 1.]

Eigen Vector :
 [[-0.81649658  0.57735027 -0.32444284]
 [ 0.40824829  0.57735027 -0.48666426]
 [ 0.40824829  0.57735027  0.81110711]]


Q2: Calculate Pearson's correlation coefficient for the following data pairs with continuous class
intervals:

Age (years)

20-30 30-40 40-50 50-60 60-70

Income(rupees)

25000 35000 45000 55000 65000

In [None]:
age = ['20-30', '30-40', '40-50', '50-60', '60-70']
income = [25000, 35000, 45000, 55000, 65000]

midpoint_age = [((int(a.split('-')[0])) + (int(a.split('-')[1])))/ 2 for a in age]

print('Correlation coefficient for the following data pairs :', np.corrcoef(midpoint_age, income)[0,1])

Correlation coefficient for the following data pairs : 1.0


Q3: A manufacturer claims that the average lifespan of its LED bulbs is 10,000 hours. A sample of 25
bulbs is tested, and the sample mean lifespan is found to be 9,800 hours with a sample standard
deviation of 200 hours. Test the manufacturer's claim at a 5% significance level.

In [None]:
# H_o = average lifespan of LED bulbs is 10,000 hours
# H_a = average lifespan of LED bulbs is not 10,000 hours


population_mean = 10000
sample_size = 25
sample_mean = 9800
sample_std = 200
alpha = 0.05

z = (sample_mean - population_mean) / (sample_std/np.sqrt(sample_size))
pval = 2*(1 - stats.norm.cdf(abs(z)))

print('Test-statstic :', z)
print('P-value :', pval)

if pval < alpha:
  print('Reject Null Hypothesis -> average lifespan of LED bulbs is not 10,000 hours')
else:
  print('Fail to Reject Null Hypothesis -> average lifespan of LED bulbs is 10,000 hours')



Test-statstic : -5.0
P-value : 5.733031438470704e-07
Reject Null Hypothesis -> average lifespan of LED bulbs is not 10,000 hours


Q4: A dataset contains the temperatures (in degrees Celsius) recorded at different times of the day.
Calculate the mean, median, and mode for the temperature data with the following class intervals:

Temperature
(°C)

10-20 20-30 30-40 40-50 50-60

Frequency

4 8 12 10 6

In [None]:
from statistics import mean, median, mode

temp = ['10-20', '20-30', '30-40', '40-50', '50-60']
freq = [4, 8, 12, 10, 6]

midpoint_t = [((int(t.split('-')[0]))+(int(t.split('-')[1])))/2 for t in temp]

data_ft = []
for i in range(len(Income_range)):
    data_ft.extend([midpoint_t[i]] * freq[i])

print('Mean :', mean(data_ft))
print('Median :', median(data_ft))
print('Mode :', mode(data_ft))

Mean : 36.5
Median : 35.0
Mode : 35.0


Q5: In a survey of a group of people, 60% own a car, and 40% own a bicycle. If 30% of the people
own both a car and a bicycle, what is the probability that a randomly chosen person owns a bicycle
given that they own a car?

In [None]:
p_car = 0.6
p_bicycle = 0.4
p_car_bicycle = 0.3
p_bicycle_given_car = p_car_bicycle / p_car
print('Probability that a randomly chosen person owns a bicycle given that they own a car :',p_bicycle_given_car)

Probability that a randomly chosen person owns a bicycle given that they own a car : 0.5


Q6: A university claims that the average score of their students on a standardized test is 1150. You
randomly select 25 students and find that their average score is 1125 with a standard deviation of 100. Conduct a hypothesis test at a 5% significance level to determine if the university's claim is
accurate.

In [None]:
# H_o = average score of students on a standardized test is 1150
# H_a = average score of students on a standardized test is not 1150

pop_mean = 1150
sample_size = 25
sample_mean = 1125
sample_std = 100
aplha = 0.05

z = (sample_mean - pop_mean)/(sample_std/np.sqrt(sample_size))
pval = 2*(1- stats.norm.cdf(abs(z)))

print('Test-statstic :', z)
print('P-value :', pval)

if pval < alpha:
  print('Reject Null Hypothesis ->  average score of students on a standardized test is not 1150')
else:
  print('Fail to Reject Null Hypothesis -> average score of students on a standardized test is 1150')

Test-statstic : -1.25
P-value : 0.2112995473337107
Fail to Reject Null Hypothesis -> average score of students on a standardized test is 1150


Q7: A survey collects data on the highest level of education achieved by 60 participants. The
categories are "High School," "Bachelor's Degree," "Master's Degree," and "Ph.D." Categorize the
data and determine the type of data. Calculate the mode.

The data is as follows:

High School, Bachelor's Degree, Master's Degree, Master's Degree, Ph.D., Bachelor's Degree,
Master's Degree, Bachelor's Degree, Bachelor's Degree, High School, Bachelor's Degree, High School,
Master's Degree, Master's Degree, Bachelor's Degree, Master's Degree, Bachelor's Degree,
Bachelor's Degree, Ph.D., Master's Degree, Bachelor's Degree, Master's Degree, Ph.D., Ph.D.,
Bachelor's Degree, Master's Degree, Bachelor's Degree, Ph.D., Bachelor's Degree, Master's Degree,
Bachelor's Degree, Master's Degree, High School, High School, Master's Degree, Ph.D., Ph.D., Ph.D.,
High School, Master's Degree, Master's Degree, Bachelor's Degree, Master's Degree, Bachelor's
Degree, High School, Ph.D., High School, High School, Master's Degree, Ph.D., Bachelor's Degree,
Ph.D., Ph.D.

In [None]:
# Its a categorical (Qualitative) -> as it can be orderd therefore "Ordinal Data" type data

import pandas as pd

ed_degree = "High School, Bachelor's Degree, Master's Degree, Master's Degree, Ph.D., Bachelor's Degree, Master's Degree, Bachelor's Degree, Bachelor's Degree, High School, Bachelor's Degree, High School, Master's Degree, Master's Degree, Bachelor's Degree, Master's Degree, Bachelor's Degree, Bachelor's Degree, Ph.D., Master's Degree, Bachelor's Degree, Master's Degree, Ph.D., Ph.D., Bachelor's Degree, Master's Degree, Bachelor's Degree, Ph.D., Bachelor's Degree, Master's Degree, Bachelor's Degree, Master's Degree, High School, High School, Master's Degree, Ph.D., Ph.D., Ph.D., High School, Master's Degree, Master's Degree, Bachelor's Degree, Master's Degree, Bachelor's Degree, High School, Ph.D., High School, High School, Master's Degree, Ph.D., Bachelor's Degree, Ph.D., Ph.D."
ed_degree = pd.Series(ed_degree.split(', '))

print("Mode :\n",dict(ed_degree.value_counts().nlargest(2)))

Mode :
 {"Bachelor's Degree": 16, "Master's Degree": 16}


Q8: Find the LU decomposition of the matrix

A = [

2 1

3 4

].

In [None]:
#LU decomposition, also known as LU factorization
#decompose a square matrix into the product of two lower triangular matrices (L and U), often with a permutation matrix (P).
#Permutation Matrix (P) ==>  P is an identity matrix with its rows reordered.
#Lower Triangular Matrix (L) ==>  A square matrix in which all entries above the main diagonal are zero
#Upper Triangular Matrix (U) ==>  A square matrix in which all entries below the main diagonal are zero.
#A = PLU

from scipy.linalg import lu

A = np.array([[2, 1], [3, 4]])
P, L, U = lu(A)

print("Permutation_Matrix_P :\n", P)
print("\nLower_Triangular_Matrix_L :\n", L)
print("\nUpper_Triangular_Matrix_U :\n", U)


Permutation_Matrix_P :
 [[0. 1.]
 [1. 0.]]

Lower_Triangular_Matrix_L :
 [[1.         0.        ]
 [0.66666667 1.        ]]

Upper_Triangular_Matrix_U :
 [[ 3.          4.        ]
 [ 0.         -1.66666667]]


Q9: Let Z be a continuous random variable representing the height (in inches) of adults in a
population, with a probability density function (PDF) given by f(z) = 0.02z for 60 ≤ z ≤ 75. Calculate
the probability that Z falls between 65 and 70 inches.

Q10: A manufacturer wants to determine if there is a significant difference in the mean lifetimes of
three different brands of light bulbs (Brand A, Brand B, and Brand C). The lifetimes (in hours) are
recorded as follows:

Brand A: [1500, 1600, 1550, 1620, 1580]

Brand B: [1680, 1630, 1650, 1675, 1660]

Brand C: [1470, 1490, 1450, 1485, 1505]

Perform an ANOVA test to determine if there are significant differences in mean lifetimes among the
three brands at a 2% significance level.

In [None]:
Brand_A = [1500, 1600, 1550, 1620, 1580]
Brand_B = [1680, 1630, 1650, 1675, 1660]
Brand_C = [1470, 1490, 1450, 1485, 1505]
alpha = 0.02

_, p = stats.f_oneway(Brand_A, Brand_B, Brand_C)

print('P-Vlaue :', p)

if p < alpha:
  print('Reject null hypothesis -> There are significant differences in mean lifetimes among the three brands')
else:
  print('Fil to Reject null hypothesis -> There are no significant differences in mean lifetimes among the three brands')

P-Vlaue : 5.263640402834659e-06
Reject null hypothesis -> There are significant differences in mean lifetimes among the three brands
