## Autoencoder in Machine Learning

An autoencoder is a type of artificial neural network used to learn efficient data representations in an unsupervised manner. It consists of two parts:

1. **Encoder**: Compresses the input data into a lower-dimensional representation.
2. **Decoder**: Reconstructs the original data from this compressed representation.

Autoencoders are commonly used for tasks like dimensionality reduction, image denoising, and anomaly detection. The goal is to minimize the difference between the input and the reconstructed output.


In [None]:
from keras.datasets import mnist
(_, _), (test_images, _) = mnist.load_data()
test_images = test_images.reshape(test_images.shape[0], -1)
test_images = test_images.astype('float32') / 255.0

: 

In [None]:
import tensorflow as tf
autoencoder = tf.keras.models.load_model('mnist_AE.h5')
reconstructed_images = autoencoder.predict(test_images)

In [None]:
import numpy as np
from matplotlib.pyplot import imshow
import matplotlib.pyplot as plt

test1=np.array(test_images[4])
test1 = test1.reshape((28,28))

imshow(test1)

In [None]:
test1ec=np.array(reconstructed_images[4])
test1ec = test1ec.reshape((28,28))
imshow(test1ec)

In [None]:
def MSE(pixels,pixelsec):
    pixels=np.array(pixels)
    pixelsec=np.array(pixelsec)
    sum=0
    for i in range(pixels.size):
        sum=sum+((1/pixels.size)*(pixels[i]-pixelsec[i])**2)
    return sum

mse=np.array([])
for i in range(10000):
    mse=np.append(mse,MSE(test_images[i],reconstructed_images[i]))

mse

In [None]:
plt.hist(mse,bins=50,edgecolor='black')

In [None]:
from scipy import stats
ks_statistic, p_value = stats.kstest(mse, cdf='norm', args=(np.average(mse), np.std(mse)))
print(p_value)

## Regression and Least Squares

**Regression** is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It helps in predicting the value of the dependent variable based on the values of the independent variables.

One of the most common techniques for fitting a regression model is **Least Squares**. This method minimizes the sum of the squared differences between the observed values and the predicted values from the model. The goal is to find the best-fitting line (or curve) that reduces these differences as much as possible.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
x_value = [-2.3, -1.1, 0.5, 3.2, 4.0, 6.7, 10.3, 11.5, 20.4]
y_value = [ -9.6, -4.9, -4.1, 2.7, 5.9, 10.8, 18.9, 20.5, 31.3]
plt.scatter(x_value, y_value, color='blue')
plt.xlabel('x_value')
plt.ylabel('y_value')

In [None]:
x = x_value
y = y_value
def avg(data):
     my_temp = 0
     for i in range (0, len(x_value)):
          my_temp = my_temp + data[i]
     return(my_temp/len(x_value))

avg_x = avg(x)
avg_y = avg(y)

def areg():
     a_sorat = 0
     a_makhraj = 0
     for i in range (0, len(x_value)):
          a_sorat = a_sorat + ((x[i] - avg_x)*(y[i] - avg_y))
          a_makhraj = a_makhraj + ((x[i] - avg_x)**2)
     return(a_sorat/a_makhraj)

a = areg()
b = avg_y - (a * avg_x)

def regression(x):
     return(b + (a * x))


y_regression = []
for i in range(0, len(x_value)):
      y_regression.append(regression(x_value[i]))

In [None]:
plt.scatter(x_value, y_value, color='blue')
plt.scatter(x_value, y_regression, color='red')
plt.xlabel('x_value')
plt.ylabel('y_value')
plt.plot(x_value, y_regression, color='orange')

In [None]:
rss = 0
for i in range(0, len(x_value)):
    rss=rss+(y_value[i]-regression(x_value[i]))**2

tss=0
for i in range(0, len(x_value)):
    tss=tss+(y_value[i]-np.average(y_value))**2

print(f"the R2 value is: {1-(rss/tss)}")

## Central Limit Theorem and Sampling

The **Central Limit Theorem (CLT)** states that the distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the original population distribution, provided the samples are independent and identically distributed.

In practice, **sampling** involves selecting a subset of data from a larger population to make inferences about the entire population. The CLT ensures that, with a sufficiently large sample size, the sampling distribution of the mean will be approximately normal, which allows for easier statistical analysis and hypothesis testing.


In [None]:
import pandas as pd
import numpy
df = pd.read_csv('FIFA2020.csv', encoding = "ISO-8859-1")

In [None]:
for i in range(len(df)):
    if numpy.isnan(df['pace'][i]):
        df['pace'][i]=numpy.average(df.loc[i, ['pace_acceleration', 'pace_sprint_speed']])

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Creating dataset
np.random.seed(10)

data = np.random.choice(df['age'], 100, replace=False)

print(f"Max is: {np.max(data)}")
print(f"Min is: {np.min(data)}")
print(f"Q1 is: {np.percentile(data,25)}")
print(f"Q2 is: {np.percentile(data,50)}")
print(f"Q3 is: {np.percentile(data,75)}")
# Creating plot
plt.boxplot(data, autorange=True)
# show plot
plt.show()

In [None]:
data = np.random.choice(df['weight'], 100, replace=False)
print(f"Average is: {np.average(data)}")
print(f"Variance is: {np.var(data)}")
print(f"Standard Variation is: {np.std(data)}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt


def Q_Q_two_sample(x, y):
# Quantile-quantile plot
    plt.figure()
    plt.scatter(np.sort(x), np.sort(y))
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.show()
    plt.close()



norm = np.random.normal(np.average(data), np.std(data), len(data))

Q_Q_two_sample(norm, data)

In [None]:
import scipy.stats as stats
statistic, p_value = stats.shapiro(data)
print(p_value)

In [None]:
po_data = np.random.poisson(3, 5000)

plt.hist(po_data,bins=10,edgecolor='black')

In [None]:
n = 50
po_data = np.random.poisson(3, n)
norm = np.random.normal(np.mean(po_data), np.std(po_data), n)
Q_Q_two_sample(norm, po_data)
_, p_value = stats.shapiro(po_data)
print(p_value)