In [5]:
x<-10.5
y<-55

x

y

class(x)
class(y)

### Type Conversion
You can convert from one type to another with the following functions:
- `as.numeric()`
- `as.integer()`
- `as.complex()`

x <- 1L #integer
y <- 2 #numeric

# convert from integer to numeric
a <- as.numeric(x)
a
class(a)

b <- as.integer(y)
b
class(b)

max(5,10,15)

min(5,10,15)

sqrt(16)

abs(-5)

abs(-4.7)

### Ceiling and Floor
- The `ceiling()` function rounds a number upwards to its nearest integer, and the `floor()` function rounds a number downwards to its nearest integer

ceiling(1.4)

floor(1.4)

ceiling(2.7)

floor(2.7)

### R Strings
String Literals
- A character, or strings are used for storing text. A string is surrounded by either single quotation marks, or double quotation marks.
- `"hello"` is same as `'hello'`
- Assigning a string to a variable is done with the variable followed by the `<-` operator and the string: `str <- "Hello World!"`

str <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla in nisi id urna dapibus
ullamcorper et eget libero. Fusce a faucibus purus. Praesent viverra diam eu venenatis dictum.
Cras sit amet vestibulum dolor.
Nulla condimentum scelerisque tellus a varius. Curabitur sit amet luctus nibh.
Mauris bibendum sodales ex eget sodales. Nullam et velit neque.
Sed ut urna vitae libero tempus pharetra nec nec sem.
Quisque lectus ex, bibendum id mauris non, fermentum maximus ex. Sed non libero arcu."

str

cat(str)

nchar(str)

### Matrices
- Matrix is a two dimensional data set with columns and rows
- A column is a vertical representation of data, while a row is a horizontal representation of data.
- A matrix can be created with the `matrix()` function. Specify the `nrow` and `ncol` parameters to get the amount of rows and columns

matrix1 <- matrix(c(1,2,3,4,5,6),nrow=3,ncol=2)

matrix1

print(matrix1)

matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow = TRUE)

matrix2 <- matrix(c('apple','banana','cherry','orange'),nrow=2,ncol=2)

matrix2

print(matrix2)

matrix3 <- matrix(c("apple",'banana', 'cherry',
                          'orange', 'grape','pineapple','pear','melon','fig'),
                       nrow=3,ncol=3)

matrix3

#### Accessing multiple rows

matrix3[c(1,2),]

#### Accessing multiple columns

matrix3[,c(1,2)]

#### Adding rows and columns
- Use `cbind()` to add columns
- Use `rbind()` to add rows
- **The cells in the new column / row must be of same length as the existing matrix**

matrix3.1 <- cbind(matrix3,c("strawberry","blueberry","raspberry"))

print(matrix3.1)

matrix3.2 = rbind(matrix3.1,c("mango","litchee","guava","cranberry"))

print(matrix3.2)

#### Removing Rows & Columns
- Use the `c()` function to remove rows and columns in a matrix

print(matrix3.2[-c(1),-c(1)])

print(matrix3.2[-c(4),-c(4)])

matrix4 <- matrix3.2[-c(4),-c(4)]

print(matrix4)

-----

### Basic Syntax

#### Mathematics

1+1

1-3

1*5

1/6

x = 1
y = 2
x + y

z = x + y
z

#### Trignometry

tan(2)

sin(2)

cos(2)

#### Statistics

# Standard Deviation
sd(c(1,2,3,4,5,6))

# Mean
mean(c(1,2,3,4,5,6))

# Variance
var(c(1,2,3,4,5,6))

# Median
median(c(1,2,3,4,5,6))

# Minimum
min(c(1,2,3,4,5,6))

# Maximum
max(c(1,2,3,4,5,6))

plot(c(1,2,3,4,5,6))

plot(c(1,2,3,4,5,6),c(2,3,4,5,6,7))

### Data Types

a <- "abc"

b <- 1.2

a+b

paste("The data class of var a is",class(a))
paste("The data class of var b is",class(b))

a <- TRUE

paste("The data class of var a is",class(a))

#### We can use `is.datatype()` to determine whether a variable is of a certain data type

is.numeric(a)

a <- 123

is.numeric(a)

b <- "56"

is.numeric(b)

is.character(b)

a <- 12

b <- "56"

a+b

b <- as.numeric(b)

a+b

`a <- 12` means that a is numeric data type. `b <- "56"` means that b is a character data type.
When a and b are added together, you will get an error because you are adding a numeric data type to a character data type. If you try to convert b to the numeric data type using `b <- as.numeric(b)` you can add a and b together because a is a numeric data type and b is now a numeric data type also

#### Vectors
- A vector is a basic data structure or R object for storing a set of values of the same data type.
- A vector is the most basic and common data structure in R.
- A vector is used when you want to store and modify a set of values
- Vectors can be created using the `c()` function as follows:

a <- c(1,2,3,4,5,6)

print(a)

a <- 1:8

print(a)

#### Lists
- List is like a vector

a <- list("a",'b',1,2)

print(a)

a

#### Matrix
- To create a matrix, we can use the following syntax:
<br>`var <- matrix(vector,nrow=m,ncol=n)`

a <- matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3)

a

print(a)

a <- matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3,dimnames = list(c("x",'y','z'),c('a','b','c')))

a

print(a)

class(a)

attributes(a)

rownames(a)

colnames(a)

##### Creating matrix using `rbind()` and `cbind()`

b <- cbind(c(1,2,3),c(4,5,6))

b

print(b)

c <- rbind(c(1,2,3),c(4,5,6))

c

print(c)

##### Transpose of a matrix

print(a)

print(t(a))

#### DataFrames

a <- data.frame(emp_id=c(1,2,3),names=c('John','James','Mary'),salary=c(111.1,222.2,333.3))

print(a)

a

typeof(a)

class(a)

ncol(a)

nrow(a)

str(a)

### Reading files

#### CSV

data <- read.csv('insurance.csv',header=TRUE,sep=',')

data

## head(data)

#### Excel

#### SPSS

----

## Reading a CSV File

data <- read.csv('monthly_crude_oil_processed.csv',header=TRUE,sep=',')

print(head(data))

head(data)

## Reading an Excel file (xlsx)

library(xlsx)

data_xl <- read.xlsx("importexport202223.xlsx",sheetIndex = 1)

print(head(data_xl))

head(data_xl)

## Reading a SPSS data file `.sav`

library(foreign)

data_sps <- read.spss('Orders.sav',to.data.frame = TRUE)

print(head(data_sps))

head(data_sps)

-----

## Descriptive Statistics

a <- c(1,2,3,4,5,5,5,6,7,8)

summary(a)

mean(a)

median(a)

mode = function(vec){
    y <- table(vec)
    print(names(y)[which(y==max(y))])
}

mode(a)

var(a)

sd(a)

## Generating Random Numbers

set.seed(123)
# The main point of using the seed is to reproduce a particular sequence of 'random' values from normal distribution

### Normal Distribution

b <- rnorm(100,3,0.5)

b

hist(b,breaks=15)

### Binomial Distribution

c <- rbinom(10000,100,0.5)

c

hist(c,breaks=20)

library(dplyr)

set.seed(69)

var1 <- rnorm(100,2,1)
var2 <- rnorm(100,3,1)
var3 <- rnorm(100,3,2)

data <- data.frame(var1,var2,var3)

head(data)

sample(data$var1,5,replace = TRUE)

data(iris)

summary(iris)

str(iris)

sample(iris$Sepal.Length,13,replace = TRUE)

### Stratified Sampling

Selecting 13 random samples from each class

iris_sample <- iris %>% group_by(Species) %>% sample_n(13)

iris_sample

## Descriptive Statistics

a <- c(1,2,3,4,5,5,5,6,7,8)

summary(a)

mean(a)

mean(var1)

mean(var2)

mean(var3)

median(a)

median(var1)

mode = function(vec){
    y <- table(vec)
    print(names(y)[which(y==max(y))])
}

mode(a)

var(a)

sd(a)

range(a)

diff(range(a))

min(a)

max(a)

paste('Range is:',min(a),",",max(a))

max(a)-min(a)

IQR(a)

quantile(a)

qqnorm(data$var1)
qqline(data$var1)

qqnorm(data$var2)
qqline(data$var2)

qqnorm(data$var3)
qqline(data$var3)

export_data <- read.csv('importexport202223.csv')

head(export_data)

qqnorm(export_data$APRIL)
qqline(export_data$APRIL)

IF p is low null will go, if p is high then we accept null

## Shapiro-Wilk Normality test

shapiro.test(data$var1)

shapiro.test(data$var2)

shapiro.test(data$var3)

hist(data$var1)

hist(data$var2)

hist(data$var3)

## CDF

- To calculate the cumulative distribution function (CDF), F(x) = P(X <= x) where X is normal, we use pnorm() function:<br>
`pnorm(1.9,3,0.5)` Above is a direct lookup for the probability P(X<1.9) where X is a normal distribution with mean of 3 and standard deviation of 0.5.<br> If we want P(X>1.9) we use `1-pnorm(1.9,3,0.5)`
- If we want to calculate the inverse CDF and lookup for p-th quantile of the normal distribution, we use:<br>
`qnorm(0.95,3,0.5)` This code looks for 95 percentile of the normal distribution with a standard deviation of 0.5 and a mean of 3. The value returned is an x value, not a probability.

pnorm(6.9,4,0.2)

qnorm(0.98,4,0.2)

## Binomial Distribution

### Probability Mass Function

- Binomial distribution has two outcomes, success or failure, and can be thought of as the probability of success or failure in a survey that is repeated various times. The number of observations is fixed and each observation or probability is independent and the probability of success is the same for all observations.
- To get the probability mass function P(X=x), of binomial distribution we can use the `dbinom()` function.

dbinom(32,100,0.5)

The above code lookup is for P(X=32) where X is the binomial distribution with a size of 100 and a probability of success is 0.5

### Cumulative Distribution Function

- To get the cumulative distribution function P(X<=x) of a binomial distribution we can use `pbinom()` function

pbinom(32,100,0.5)

The above code lookup is for p(X <= 32) where X is the binomial distribution with a size of 100 and a probability fo success of 0.5

### P<sup>th</sup> quantile

- To get the p-th quantile of the binomial distribution, we can use the `qbinom()` function

qbinom(0.3,100,0.5)

The above code lookup is for the 30<sup>th</sup> quantile of the binomial distribution where the size is 100 and the probability of success is 0.5. The value is a cumulative value.

### Generating random variables

- To generate random variables from a binomial distribution, we can use the `rbinom()` function

set.seed(420)

a <- rbinom(1000,100,0.5)

hist(a,breaks=20)

hist(a,breaks=15)

hist(a,breaks=30)

We can use the `rbinom()` or `rnorm()` to generate random variables to simulate a new dataset

## `Summary()` and `str()` functions

- The `summary()` and `str()` functions are the fastest wat to get descriptive statistics of the data. The `summary()` function gives the basic descriptive statistics of the data. The `str()` function, gives the structure of the variables.

summary(data)

str(data)

## Correlations

- Correlations are statistical associations to find how close two variables are and to derive the linear relationship between them. In predictive analytics, you can use correlation to find which variables are more related to the target variable and use this to reduce the number of variables. Correlation does not mean a casual relationship. Correlation finds how close two variables are, but does not tell you the how and why of the relationship. Causation tells you that one variable change will cause another variable to change.

iris_data <- data.frame(iris)

colnames(iris)

col_names

col_names = length(colnames(iris))-1
for(i in 1:col_names)
    {
    for (j in i+1:col_names-i)
    {
        print(paste("Correlation between",colnames(iris_data)[[i]],"and",colnames(iris_data)[[j]]))
        print(cor(iris_data[,colnames(iris_data)[[i]]],iris_data[,colnames(iris_data)[[j]]]))
    }
}

cor(data$var1,data$var2)

The correlation has a range from -1.0 to 1.0, when the correlation is 0, there is no correlation or relationship. When the correlation is more than 0, it is a positive relationship. Positive correlation means that wehn one variable's value increases, the other variable's values also increase. When the correlation is less than 0, it is a negative relationship. Negative correlation means that when one variable's value increases, the other variable's value decreases. 1 is the perfect positive correlation and -1 is the perfect negative correlation. Hence, the larger the value towards 1, or smaller the value towards -1, the better the relationship.
**-0.0671436526092431** means that the correlation between var1 and var2 is a negative correlation, as it is closed to zero the relationship is not good

cor(data$var1,data$var3)

The correlation has a range from -1.0 to 1.0, when the correlation is 0, there is no correlation or relationship. When the correlation is more than 0, it is a positive relationship. Positive correlation means that wehn one variable's value increases, the other variable's values also increase. When the correlation is less than 0, it is a negative relationship. Negative correlation means that when one variable's value increases, the other variable's value decreases. 1 is the perfect positive correlation and -1 is the perfect negative correlation. Hence, the larger the value towards 1, or smaller the value towards -1, the better the relationship.
**0.0713532054580629** means that the correlation between var1 and var3 is a positive correlation, as it is closed to zero the relationship is not good

cor(data$var2,data$var3)

The correlation has a range from -1.0 to 1.0, when the correlation is 0, there is no correlation or relationship. When the correlation is more than 0, it is a positive relationship. Positive correlation means that wehn one variable's value increases, the other variable's values also increase. When the correlation is less than 0, it is a negative relationship. Negative correlation means that when one variable's value increases, the other variable's value decreases. 1 is the perfect positive correlation and -1 is the perfect negative correlation. Hence, the larger the value towards 1, or smaller the value towards -1, the better the relationship.
**-0.0796639546553364** means that the correlation between var2 and var3 is a negative correlation, as it is closed to zero the relationship is not good

## Covariance

Covariance is a measure of variability between two varaibles. The greater the value of one variable and the greater of other variable means it will result in a covariance that is positive. The greater value of one variable to the lesser value of the other variable will result in a negative covariance. Covariance shows the linear relationship between both variables, but the covariance magnitude is difficult to interpret.

col_names = length(colnames(iris))-1
for(i in 1:col_names)
    {
    for (j in i+1:col_names-i)
    {
        print(paste("Covariance between",colnames(iris_data)[[i]],"and",colnames(iris_data)[[j]]))
        print(cov(iris_data[,colnames(iris_data)[[i]]],iris_data[,colnames(iris_data)[[j]]]))
    }
}

cov(data$var1,data$var2)

Correlation has a range of -1 to 1. Covariance does not have a range. Correlation is good for measuring how good the relationship between two variables in. When two variables have a positive covariance, when one variable increases, the other variables increases. When two variables have a negative covariance, when one variable increases, the other variable decreases. When two variable are independent of each other the covariance is zero.<br>
-0.0585462183730391 means the covariance is negative, and it is very close to zero, so the relationship between the two variables is not very good. Correlation and covariance are usually within descriptive statistics

cov(data$var1,data$var3)

Correlation has a range of -1 to 1. Covariance does not have a range. Correlation is good for measuring how good the relationship between two variables in. When two variables have a positive covariance, when one variable increases, the other variables increases. When two variables have a negative covariance, when one variable increases, the other variable decreases. When two variable are independent of each other the covariance is zero.<br>
0.136573405993444 means the covariance is positive, and it is very close to zero, so the relationship between the two variables is not very good. Correlation and covariance are usually within descriptive statistics

cov(data$var2,data$var3)

Correlation has a range of -1 to 1. Covariance does not have a range. Correlation is good for measuring how good the relationship between two variables in. When two variables have a positive covariance, when one variable increases, the other variables increases. When two variables have a negative covariance, when one variable increases, the other variable decreases. When two variable are independent of each other the covariance is zero.<br>
-0.156280168159263 means the covariance is negative, and it is very close to zero, so the relationship between the two variables is not very good. Correlation and covariance are usually within descriptive statistics

-----

## Histogram

|Life of tyre (in '000kms)|15-20|20-25|25-30|30-35|35-40|40-45|45-50|
|:-----------------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|No.of tyres|5|8|13|20|14|6|4|

Note: In seq statement second number can be any number between 47.5 and 50
<br>Since width of class is 5 ie w = 5

x <- seq(17.5,48.5,5)

w <- 5

f <- c(5,8,13,20,14,6,4)

lb <- (x-w)/2
ub <- (x+w)/2

brks <- c(lb[1],ub)

y <- rep(x,f)

y

hist(y)

hist(y,xlab="Life of tyre",ylab="Number of tyres",main="Histogram")

#### We can impose frequency curve on histogram by adding line statement

hist(y,xlab="Life of tyre",ylab="Number of tyres",main="Histogram")
lines(x,f)

## Frequency Curve

x <- seq(17.5,47.5,5)

f <- c(5,8,13,20,14,6,4)

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main='Frequency curve')

plot(x,f,xlab="Life of tyre",ylab="Number of tyres",main='Frequency curve')

## Frequency Polygon
- For frequency polygon we have to add one extra point to the starting and ending in x and for f we have to add 0 in starting and ending

x <- seq(12.5,52.5,5)

f <- c(0,5,8,13,20,14,6,4,0)

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main="Frequency Polygon")

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main="Frequency Polygon")
points(x,f,pch=16)

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main="Frequency Polygon")
points(x,f,pch=25)

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main="Frequency Polygon")
points(x,f,pch=69)

plot(x,f,"l",xlab="Life of tyre",ylab="Number of tyres",main="Frequency Polygon")
points(x,f,pch=34)

## Less tha type ogive

x <- seq(17.5,48.5,5)

f <- c(0,5,8,13,20,14,6,4)

lb <- x-w/2
ub <- x+w/2

k <- length(x)

lb1 <- c(lb,50)
ub1 <- c(15,ub)

lcf <- cumsum(f)

plot(ub1,lcf,"l",xlim=c(15,50),xlab='Class limits',ylab='cumfreq',
     main='Less than ogive',lwd=2)
points(ub1,lcf,pch=16)
# lwd is used for line and for width of line

## More than type ogive

x <- seq(17.5,48.5,5)

f <- c(0,5,8,13,20,14,6,4)

lb <- x-w/2
ub <- x+w/2

k <- length(x)+1

lb1 <- c(lb,50)
ub1 <- c(15,ub)

mcf <- 1:k

for(i in 1:k)
    {
    mcf[i] = sum(f[k:i])
}

plot(lb1,mcf,"l",xlim=c(15,50),xlab='Class limits',ylab='cumfreq',
     main='More than ogive',lwd=2)
points(lb1,mcf,pch=16)
# lwd is used for line and for width of line

## Boxplot

x <- c(0,12,14,17,22,7,11,29,21,20,30,15,17,12,16,9,8,12,16,9,8,12,16,9,8,12,23,26,14,19)

boxplot(x)

boxplot(x,ylab='Marks')

boxplot(x,ylab='Marks')
f = fivenum(x)
text(rep(1.3,5),f,labels=c('Minimum','1st Quartile','Median','3rd Quartile','Maximum'))

------

x <- c(0,12,14,17,22,7,11,29,21,20,30,15,17,12,16,9,8,12,16,9,8,12,16,9,8,12,23,26,14,19,100)

boxplot(x)

boxplot(x,ylab='Marks')

boxplot(x,ylab='Marks')
f = fivenum(x)
text(rep(1.3,5),f,labels=c('Minimum','1st Quartile','Median','3rd Quartile','Maximum'))

## Pie chart

Represent the data by pie diagram
|Item|% expenses|
|:--:|:--------:|
|Food|25|
|Rent|18|
|Cloths|22|
|Travel|20|
|Misc|15|

i <- c("Food","Rent","Cloths","Travel","Misc")
e <- c(25,18,22,20,15)

pie(e,radius=1)

pie(e,main='Percentage Expenses',col=3:7,labels=i) #Labels added, for specific colours use col

pie(e,main='Percentage Expenses',col=1:7,labels=i) #Labels added, for specific colours use col

#### Homework
Draw a histogram, frequency curve, frequency polygon, less than ogive curve, more than ogive curve of the following data.

x = View(airquality)

str(airquality)

sum(is.na(airquality))

temperature <- airquality$Temp

hist(temperature, main="Maximum daily temperature at La Guardia Airport",xlab='Temperature in degrees Farenheit',
    xlim=c(50,100),col='grey',freq=FALSE)
# Creating histogram considering density instead of frequency, in this case the total area of the histogram is 1

hist(airquality$Solar.R, main="Maximum daily temperature at La Guardia Airport",xlab='Temperature in degrees Farenheit',
    xlim=c(0,350),col='grey',freq=FALSE)
# Creating histogram considering density instead of frequency, in this case the total area of the histogram is 1

hist(temperature, main="Maximum daily temperature at La Guardia Airport with 4 breaks",xlab='Temperature in degrees Farenheit',
    breaks=4,col='grey',freq=FALSE)
# Creating histogram considering density instead of frequency, in this case the total area of the histogram is 1

With the breaks argument we can specify the number of cells

hist(temperature, main="Maximum daily temperature at La Guardia Airport with 20 breaks",xlab='Temperature in degrees Farenheit',
    breaks=20,col='grey',freq=FALSE)
# Creating histogram considering density instead of frequency, in this case the total area of the histogram is 1

------

## Q1
Draw a histogram, frequency curve, frequency polygon, less than type ogive, more than type ogive curve of the following data.

||||||||||
|:----------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|Sales (0'00)|20-25|25-30|30-35|35-40|40-45|45-50|50-55|55-60|
|No.of days|5|9|13|28|20|12|10|3|

sales <- seq(22.5,57.5,5)

w <- 5

freq <- c(5,9,13,28,20,12,10,3)

### Histogram

y <- rep(sales,freq)

hist(y,xlab="Sales",ylab="Number of days",main="Histogram")

### Frequency Curve

plot(sales,freq,'l',xlab='Sales',ylab='Number of days',
     main='Frequency Curve')

### Frequency Polygon

sales <- seq(17.5,62.5,5)
freq <- c(0,5,9,13,28,20,12,10,3,0)

plot(sales,freq,'l',xlab='Sales',ylab='Number of days',
     main='Frequency Polygon')
points(sales,freq,pch=16)

### Less than ogive

sales <- seq(22.5,57.5,5)
freq <- c(0,5,9,13,28,20,12,10,3)

lb <- sales-w/2
ub <- sales+w/2

k <-length(sales)

lb1 <- c(lb,60)
ub1 <- c(20,ub)

lcf <- cumsum(freq)

plot(ub1,lcf,'l',xlim=c(20,60),xlab='Class Limits',
     ylab='Cumulative Frequency',main='Less than ogive',lwd=2)
points(ub1,lcf,pch=16)

### More than ogive

sales <- seq(22.5,57.5,5)
freq <- c(0,5,9,13,28,20,12,10,3)

lb <- sales-w/2
ub <- sales+w/2

k <-length(sales)+1

lb1 <- c(lb,60)
ub1 <- c(20,ub)

mcf <- 1:k

for (i in 1:k)
    {
    mcf[i] = sum(freq[k:i])
}

plot(lb1,mcf,'l',xlim=c(20,60),xlab='Class Limits',
     ylab='Cumulative Frequency',main='More than ogive',lwd=2)
points(lb1,mcf,pch=16)

## Q2
The following figures relate to the cost of construction of a house in a city.

||||||||
|:--:|:----:|:---:|:----:|:----:|:----:|:-----------:|
|Item|Cement|Steel|Bricks|Timber|Labour|Miscellaneous|
|% Expenditure|20|18|10|15|25|12|

Present the data with the help of a suitable diagram

item <- c('Cement','Steel','Bricks','Timber','Labour','Miscellaneous')
expenditure_per <- c(20,18,10,15,25,12)

pie(expenditure_per,col = 3:8,labels = item,
    main='Cost Distribution of house construction')

## Q3
The following data give the daily expenses of 40 school children from a certain locality
```
21,50,35,39,48,46,36,54,42,30,29,42,32,40,34,
31,35,37,52,44,39,42,32,40,34,31,100
```

Draw boxplot and write conclusion

expenses <- c(21,50,35,39,48,46,36,54,42,30,29,42,32,40,34,
              31,35,37,52,44,39,42,32,40,34,31,100)

boxplot(expenses,ylab='Expenses')
summ <- fivenum(expenses)
text(rep(1.3,5),summ,
labels = c('Minimum','1st Quartile','Median','3rd Quartile','Maximum'))

- There is one outlier in the given data, as the data point is higher than the upper bound.
- Median expense is 39
- Average expense is ~40.56
- Minimum expense is 21 and maximum expense is 100
- 1st Quantile value is 33, 3rd Quantile value is 43

----

## Hypothesis Testing and P-Value

- A hypothesis can also be a null hypothesis, H<sub>0</sub>, and an alternate hypothesis, H<sub>1</sub>. You can write the null hypothesis and alternate hypothesis as follows:<br>
H<sub>0</sub>: μ<sub>1</sub> = μ<sub>2</sub><br>
H<sub>1</sub>: μ<sub>1</sub> != μ<sub>2</sub>
where μ<sub>1</sub> is the mean of one data and μ<sub>2</sub> is the mean of another data. We can use statistical tests to get your p-value. We use a t-test for continuous variables or data and a chi-square test for categorical variables or data. For more complex testing, you use ANOVA. If data is not normally distributed, use non-parametric tests.<br>
A P-value helps to determine the significance of statistical test results. A small p-value < alpha, which is usually 0.05, indicated that the observed data is sufficiently inconsistent with the null hypothesis, so the null hypothesis may be rejected. The alternate hypothesis is true at 95% confidence interval. A larger p-value means that we failed to reject null hypothesis.

### T-Test

A t-test is one of the more important tests in statistics. A t-test is used to determine whether the mean between two data points or samples are equal to each other.
H<sub>0</sub>: μ<sub>1</sub> = μ<sub>2</sub><br>
H<sub>1</sub>: μ<sub>1</sub> != μ<sub>2</sub>

#### Types of t-test

##### One-Sample Test
- To use a one-sample t-test in R, you can use the `t.test()` function

set.seed(123)

var1 <- rnorm(100,mean=2,sd=1)
var2 <- rnorm(100,mean=3,sd=1)
var3 <- rnorm(100,mean=3,sd=2)

data <- data.frame(var1,var2,var3)

t.test(data$var1,mu=0.6)

H<sub>0</sub>: μ<sub>1</sub> = m<br>
H<sub>1</sub>: μ<sub>1</sub> != m<br>
m is 0.6, The p-value is 2.2e<sup>-16</sup>, so the p-value is less than 0.05, which is the alpha value. Therefore, the null hypothesis can be rejected. The alternate hypothesis, μ != 0.6 is true at 95% confidence interval.

###### Q2

x <- c(5,3,4,3,2,6,3,2,3,6,7,5,3)

x

t.test(x)

m is 0, the p-value is 1.347e<sup>-6</sup>, therefore we can reject the null hypothesis. Therefore the alternate hypothesis μ != 0 is true at 95% interval.

###### Q3
- We have collected a random sample of 31 energy bars from a number of different stores to represent the population of energy bars available to the general consumer. The labels on the bars claim that each bar contains 20 grams of protein.

x <- c(20.70,20.75,22.14,19.72,25.06,27.46,22.91,19.56,18.28,22.44,22.15,25.34,21.10,16.26,19.08,19.85,20.33,18.04,
       17.46,19.88,21.29,21.54,24.12,20.53,21.39,24.75,21.08,19.95,22.12,22.33,25.79)

t.test(x,mu=20)

###### Q4
We have the potato yield from 12 different farms. We know that the standard potato yield for the given variety is mu=20. Test the potato yield from these farms is significantly better than the standard yield.<br>
H<sub>0</sub>: mu=20
H<sub>1</sub>: mu > 20

x <- c(21.5,24.5,18.5,17.2,14.5,23.2,22.1,20.5,19.4,18.1,24.1,18.5)

t.test(x,mu=20,alternative = 'greater')

##### Two Sampled T-Test
The two sample unpaired t-test is when you compare two means of two independent samples. To use a two-sample unpaired t-test with a variance as equal in R:<br>
To test:<br>
H<sub>0</sub>: muA - muB = 0<br>
H<sub>1</sub>: muA - muB != 0

set.seed(123)

var1 <- rnorm(100,mean=2,sd=1)
var2 <- rnorm(100,mean=3,sd=1)
var3 <- rnorm(100,mean=3,sd=2)

data <- data.frame(var1,var2,var3)

t.test(data$var1,data$var2,var.equal = TRUE,paired = FALSE)

- The p-value is 7.843e<sup>-0</sup> so it is less than 0.05, so we can reject the null hypothesis

###### Q1
A group of men and women who did workouts at a gym three times a week for a year. Then, their trainer measured the body fat. The table below shows the data

|Group|Body Fat Percentages|
|:---:|:------------------:|
|Men|13.3|6.0|20.0|8.0|14.0|19.0|18.0|25.0|16.0|24.0|15.0|1.0|15.0|
|Women|22.0|16.0|21.7|21.0|30.0|26.0|12.0|23.2|28.0|23.0|

Check whether the underlying populations of men and women at the gym have the same mean body fat.

men <- c(13.3,6.0,20.0,8.0,14.0,19.0,18.0,25.0,16.0,24.0,15.0,1.0,15.0)
women <- c(22.0,16.0,21.7,21.0,30.0,26.0,12.0,23.2,28.0,23.0)

t.test(men,women,var.equal=TRUE,paired=FALSE)

----

x <- rnorm(1000,mean=1,sd=1)

y <- rnorm(1000,mean=2,sd=2)

data <- data.frame(x,y)

mod <- lm(data$y ~ data$x,data=data)

mod

summary(mod)

- The output depicts that the linear equation is y = 0.01128x + 2.01385
- The p-values of <2e<sup>-16</sup>, 0.868 which tell you the significance of the linear model. When the p-value is less than 0.05 the model is significant.
<br>**Hypothesis**
- H<sub>o</sub>: Coefficient associated with the variable is equal to zero
- H<sub>1</sub>: Coefficient is not equal to zero (there is a relationship)
<br>The intercept has a p-value of <2e<sup>-16</sup>, which is smaller than 0.05 so there is significance with the y-variable. The significance is indicated with the number of `*`. The x has a p-value of 0.868, which is more than 0.05 so there is no significance with the y-variable. The null hypothesis is true at 95% confidence interval. R-square depicts the proportion of the variation in the dependent variable.
<br>Hence the higher the R-squared and the adjusted R-squared the better the linear model. The lower the standard error, the better the model

## Q1. The data regarding the production of wheat in tons (X) and the price of the kilo of flour (Y) in the decade of the 80's in Spain were:

||||||||||||
|:--------------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Wheat Production|30|28|32|25|25|25|22|24|35|40|
|Flour Price|25|30|27|40|42|40|50|45|30|25|

a. Fit the regression line using the method of least squares
<br>b. Compute a 95% confidence interval for the slope of the regression line
<br>c. Text the hypothesis that the price of flour depends linearly on the wheat production

wheat_prod = c(30,28,32,25,25,25,22,24,35,40)

flour_price = c(25,30,27,40,42,40,50,45,30,25)

data = data.frame(wheat_prod,flour_price)

mod_lm = lm('flour_price ~ wheat_prod',data=data)

mod_lm

summary(mod_lm)

- The output depicts the linear equation is y = -1.3537x + 74.1151
- The p-values of 2.85e<sup>-05</sup>, 0.00198 tell the significance of the model, when less than 0.05 it is significant
<br>**Hypothesis** for intercept
- H<sub>o</sub>: There is no significant relationship between intercept and y-variable.
- H<sub>1</sub>: There is significant relationship between intercept and y-variable.
- Conclusion: As the p-value for intercept is 2.85e<sup>-05</sup> we can reject the null hypothesis, i.e there is significant relationship between intercept and y-variable at 95% confidence interval
<br>**Hypothesis** for x-variable wheat production
- H<sub>o</sub>: There is no significant relationship between intercept and y-variable.
- H<sub>1</sub>: There is significant relationship between intercept and y-variable.
- Conclusion: As the p-value for wheat production is 0.00198 we can reject the null hypothesis, i.e there is significant relationship between wheat production and flour prices (y-variable) at 95% confidence interval

confint(mod_lm,level=0.95)

## Q2. Fit the regression line using the method of least squares

|Verbal IQ|Brain size|
|:-------:|:--------:|
|132|816.932|
|132|951.545|
|90|928.799|

verbal_iq = c(132,132,90,136,90,129,120,100,71,132,112,129,86,90,83,126,126,90,129,86)
brain_size = c(816.932,951.545,928.799,991.305,854.258,833.868,856.472,878.897,865.363,852.244,
               808.02,790.619,831.772,798.612,793.549,866.662,857.782,834.344,948.066,893.983)

data <- data.frame(verbal_iq,brain_size)

qqnorm(data$brain_size)
qqline(data$brain_size)

hist(data$brain_size)

shapiro.test(data$brain_size)

mod <- lm('verbal_iq ~ brain_size',data=data)

mod

summary(mod)

- The Adj R-squared value is ~1.3%, therefore we can say that the model doesn't fit the data well.
- The p-value for brain_size is 0.278 and for the intercept is 0.755, therefore we can say that both intercept and brain_size are not significant.
- The p-value for the model is 0.278, therefore we can say that the model is not significant.

----

## Multiple Linear Regression

- Multiple linear regression is built from simple linear regression
- It is used when we have more than one independent variable
- The equation of multiple linear regression is y = b<sub>0</sub> + b<sub>1</sub>x<sub>1</sub> + b<sub>2</sub>x<sub>2</sub>+.....+b<sub>k</sub>x<sub>k</sub>

set.seed(123)

x <- rnorm(100,mean=1,sd=1)
x2 <- rnorm(100,mean=2,sd=5)

y <- rnorm(100,mean=2,sd=2)

data <- data.frame(x,x2,y)

mod <- lm('y ~ x+x2',data=data)

mod

summary(mod)

- y = -0.266343x + 0.009525x2
- The p-values are 7.97e<sup>-13</sup>, 0.207, 0.810, 0.4295. The intercept is significant because the p-value is 7.97e<sup>-13</sup>, which is smaller than 0.05

## Q1
Construct a multiple linear regression model that explores the relationship between blood pressure, weight, height and age.
1. How well does the regression model fit the data?
2. Check the significance of the beta coefficients.
3. Check overall significance of regression model.

blood_pressure = c(105,106,108,110,113,115,118,119,120,122)
weight = c(75,80,89,90,93,95,96,99,101,102)
height = c(172,175,170,174,178,179,180,183,185,188)
age = c(19,18,20,20,21,22,24,25,29,30)

blood_pressure_data <- data.frame(blood_pressure, weight, height, age)

qqnorm(blood_pressure_data$weight)
qqline(blood_pressure_data$weight)

qqnorm(blood_pressure_data$height)
qqline(blood_pressure_data$height)

qqnorm(blood_pressure_data$age)
qqline(blood_pressure_data$age)

shapiro.test(blood_pressure_data$weight)

shapiro.test(blood_pressure_data$height)

shapiro.test(blood_pressure_data$age)

mod_lm = lm('blood_pressure ~ weight+height+age',data=blood_pressure_data)

mod_lm

summary(mod_lm)

- Adj. R-squared score is ~96%, therefore we can say that the regression model fits the data well.
- P-value for weight is `0.00564`, height is `0.05161`, age is `0.51022` and intercept is `0.96206`. Only Weight is significant, as their p-values are less than 0.05
- The p-value for the model is 2.817e<sup>-05</sup>, therefore we can say that the model is significant.

----

# Non parametric Test

- The non parametric test is a test that does not require the variable and sample to be normally distributed. Most of the time we use parametric tests like the t-test, chi-square test and ANOVA because they are more accurate.
- You use non-parametric tests when you do not have normally distributed data and the sample data is big.

## Wilcoxon Signed Rank Test

- The Wilcoxon signed rank test is used to replace the one-sample t-test.
- For each x<sub>i</sub>, for i = 1,2,.....,n the signed difference is d<sub>i</sub> = x<sub>i</sub> - \mu<sub>0</sub>, where \mu<sub>0</sub> is the given median.
- The null hypothesis is that the population median has the specified value of \mu<sub>0</sub>.
    - Null Hypothesis: H<sub>0</sub> : \mu = \mu<sub>0</sub>
    - Alternate Hypothesis: H<sub>1</sub> : \mu != \mu<sub>0</sub>
    
.....

To use the Wilcoxon signed rank test in R, you can first generate the data using random.org packages, so that the variables are not normally distributed.

```
install.packages('random')
```

library(random)

var1 <- randomNumbers(n=100,min=1,max=1000,col=1)
var2 <- randomNumbers(n=100,min=1,max=1000,col=1)
var3 <- randomNumbers(n=100,min=1,max=1000,col=1)

n is the number of random numbers, min is the minimum value, max is the maximum value and col is the number of columns for all the numbers.
This is the method to generate true random numbers in R. Your data may be different because the data is generated randomly. You can then create the data using

data <- data.frame(var1[,1],var2[,1],var3[,1])

print(head(data))

To use Wilcoxon signed rank test, you can use the wilcox.test() function

wilcox.test(data[,1],mu=0,alternatives='two.sided')

The p-value is < 2.2e<sup>-16</sup>, which is less than 0.05. Hence, you reject the null hypothesis. There are significant differences in the median for the first variable median and the median of 0. The alternate hypothesis is true at the 95% confidence interval.

## R-program to illustrate one-sample Wilcoxon signed rank test

set.seed(1234)

my_data = data.frame(name=paste0(rep('R_',10),1:10),weight=round(rnorm(10,30,2),1))

print(head(my_data))

res = wilcox.test(my_data$weight,mu=25)

res

As the p-value is 0.005793, which is less than 0.05. We can reject the null hypothesis. There are significant differences in the median for the first variable median and the median of 25. The alternate hypothesis is true at the 95% confidence interval.

res1 = wilcox.test(my_data$weight,mu=25,alternative = 'less')

res1

As the p-value is 0.9979, which is more than 0.05. We cannot reject the null hypothesis. There is no significant differences in the median for the first variable median and the median of 25.

res2 = wilcox.test(my_data$weight,mu=25,alternative = 'greater')

res2

As the p-value is 0.002897, which is more than 0.05. We can reject the null hypothesis. The median weight is greater than 25.

## Wilcoxon-Mann-Whitney Test

var1 <- randomNumbers(100,1,1000,1)
var1 <- randomNumbers(100,1,1000,1)
var1 <- randomNumbers(100,1,1000,1)

data <- data.frame(var1[,1],var2[,1],var3[,1])

wilcox.test(data[,1],data[,2],correct=FALSE)

The p-value is 0.4703, which is more than 0.05. Hence we can not reject the null hypothesis. There are no significant differences in the median for first variable median and second variable median. The null hypothesis is true at the 95% confidence interval.

## Kruskal-Wallis Test

The Kruskal-Wallis test is a non parametric test that is an extension of the Mann-Whitney U test for three or more samples. The test requires samples to be identically distributed. Kruskal-Wallis is an alternative to one-way ANOVA. The Kruskal-Wallis test tests the differences between scores of k independent samples of unequal sizes with the i<sup>th</sup> sample containing l<sub>i</sub> rows.

data('airquality')

kruskal.test(airquality$Ozone ~ airquality$Month)

## Q1

x = c(12.3,15.4,10.3,8,14.6,15.7,10.8,45,12.3,8.2,20.1,26.3,32.4,41.2,35.1,25,8.2,18.4,32.5)
y = c(rep('A',5),rep('B',7),rep('C',7))

kruskal.test(x~y)

## Q2

x = c(166.7,172.2,165,176.9,166.2,157.3,166.7,161.1,158.6,176.4,153.1,156,162.8,142.4,162.7,162.4)
y = c(rep('0',4),rep('1',4),rep('3',4),rep('9',4))

kruskal.test(x~y)

# Friedmann test

It is a non-parametric test which is used for three or more samples. It is used when there are two independent samples. It is an alternative of two way anova.
<br>Ho : \muo = \mu1 = \mu2 = ... = \muk
<br>Ha: \mu0 != \muk
<br>
\mu is median

obs = c(45,49,38,48,45,39,43,42,35,41,39,36)
# obs = c(45,48,43,41,49,45,42,39,38,39,35,36)
soyabean_var = c(rep('A',3),rep('B',3),rep('C',3),rep('D',3))
block = c(rep(c('1','2','3'),4))

df = data.frame(matrix(obs,nrow=3,ncol=4,byrow = TRUE),row.names = c(1,2,3))

colnames(df) = c('A','B','C','D')

df

friedman.test(obs~soyabean_var|block)

obs = c(5,7,3,4,3,4,5,5,8,6,7,7,9,2,6,8,2,3,4,3,7,9,10,9,6,8,8,6,1,5,2,1,4,1,1,2,10,10,9,10)
students = c(rep('1',4),rep('2',4),rep('3',4),rep('4',4),rep('5',4),rep('6',4),rep('7',4),
            rep('8',4),rep('9',4),rep('10',4))
prof = c(rep(c(1,2,3,4),10))

friedman.test(obs~students|prof)

## Q3

reaction_time = c(1.21,1.63,1.42,2.43,1.16,1.94,1.48,1.85,2.06,1.98,1.27,2.44,1.56,2.01,1.7,2.64,1.48,2.81)
lbl = c(rep('A',6),rep('B',6),rep('C',6))

sub = c(rep(c('1','2','3','4','5','6'),3))

kruskal.test(reaction_time~lbl)

friedman.test(reaction_time~lbl|sub)