# Data structure and manipulation in R

![WhatsApp%20Image%202023-03-30%20at%2009.44.03.jpeg](attachment:WhatsApp%20Image%202023-03-30%20at%2009.44.03.jpeg)

# Data Manipulation
• Data manipulation is required to bring accuracy and precision in the data.
• R base package has 'apply' functions in it, which helps in manipulating the data multiple times, thus avoiding the use of loop constructs.


**Apply Functions**
• The apply functions are used to perform a specific change to each column or row of R objects.
• Types of apply functions in R:
• apply()
• lapply()
• sapply()
• tapply()
• mapply()
• vapply()
• rapply()

apply(), lapply(), and sapply() are the most commonly used functions.


**Types of Apply Functions**
1- apply() helps apply a function to a matrix row or column and returns a vector, array, or list.
Syntax:
apply (x, margin, function)
Where,
• margin indicates whether the function is to be applied to a row or column.
   • margin = 1 indicates that the function needs to be applied to a row.
   • margin = 2 indicates that the function needs to be applied to a column.
• function can be any function such as mean, sum, or average
##Examples:
• m <- matrix ( c (1,2,3,4), 2,2 )
• apply (m, 1, sum)
• apply (m, 2, sum)


2-lapply() takes a list as an argument and works by looping through each element in the list.
The output of the function is a list.
Syntax:
lapply (list, function)
##Examples:
• list <-list (a=c (1, 1), b=c (2,2), c=c (3, 3))
lapply (list, sum)
lapply (list, mean)


3-• sapply) is similar to lapply(), except that it simplifies the result so that:
• If the result is a list and every element in the list is of size 1, then a vector is returned.
• If the result is a list and every element in the list is of the same size (>1), then a matrix is returned.
• Otherwise, the result is returned as a list itself.
Syntax:
sapply (list, func)
##Examples:
sapply (list, func)
• list <- list (a = c (1,1), b=c (2,2), c=c (3, 3)) sapply (list, sum)
• list <- list (a = c (1,2), b=c (1,2,3), c=c (1,2,3,4)) sapply (list, range)

-----------------------------------------------------------------------------------------------------------------------------

# dplyr Package
• There are packages available consisting of many functions which help in data manipulation.
• dplyr is one of the most commonly used functions and is a powerful R package.

**Features of dplyr Package**
• dplyr package transforms and summarizes tabular data with rows and columns.
• It provides simple verbs- functions that correspond to the most common data manipulation tasks to help you translate your thoughts into code.
dplyr package transforms and summarizes tabular data with rows and columns.
• It provides simple verbs- functions that correspond to the most common data manipulation tasks to help you translate your thoughts into code.
• Select
• Filter
• Arrange
• Mutate
• Summarize

##The dplyr package has the following functions:
• Select()
• Filter()
• Arrange()
• Mutate()
• Summarize()
• To understand the use of these functions, let's consider the dataset "mtcars"
1- select():
This function allows you to select specific columns from large data sets.
##Examples:
Different ways to select column by name:
select (mtcars, mpg, disp)
select (mtcars, mpg:hp)
select (iris, starts_with ("Petal" ))
select (iris, ends_with ("Width" ))
select (iris, contains ("etal")) 
select (iris, matches (" .t."))

2- filter():
• This function enables easy filtering, zoom in, and zoom out of relevant data.
• The two types of filters are explained below:
##Examples:
Simple filter
filter (mcars, cyl == 8)
filter (mcars, cyl < 6)
Multiple criteria filter
filter (mcars, cyl < 6 & Vs == 1) 
filter (mcars, cyl < 6 | vs == 1)
Comma separated arguments are equivalent to the "And" condition.
Example: filter(mtcars, cyl < 6, Vs == 1)


3- Arrange():
This function helps arrange the data in a specific order.
##Examples:
arrange (mtcars, desc (disp)) arrange (mtcars, cyl, disp)


4- Mutate():
This function helps add new variables to an existing data set.
##Example:
mutate (mtcars, my custom disp = disp / 1.0237)


5- Summarize():
This function summarizes multiple values to a single value in a dataset.
##Examples:
Here are examples to use this function without and with the group function:
summarise (group by (mcars, cyl), mean (disp))
summarise (group by (mtcars, cyl), m = mean (disp), sd = sd (disp))
• Here's a list of summary functions that can be used within this function:
• first: Returns the first element of a vector
• last: Returns the last element of a vector
• nth(x,n): Returns the 'n'th element of a vector
• n(): Returns the number of rows in a dataframe
• n distinct(x): Returns the number of unique values in vector x
• In addition, the following functions are also used:
   mean - max - var - median - min - length - mode - sum - IQR

------------------------------------------------------------------------------------------------------------------------------
# Data structure
**Identifying Data Structures**
A data scientist has to work on a dataset that is a blend of character and numeric values.
The data is related to direct marketing campaigns of a banking institution.
The data scientist has to identify the data structures.

**Types of Data Structures**
Atomic Vectors
Matrix
Arrays
Factors
Data Frames
Lists

1- Atomic Vectors:
• An atomic vector is a one-dimensional object and is the simplest data structure.
• It is called an atomic vector as all elements in it are of the same type.
• The data types in atomic vectors:
• Numeric Data Type
• Integer Data Type
• Character Data Type
• Logical Data Type
##Example
a <- c(1, 2, 5, 3, 6, -2, 4)
b <- c("'one", "two", "three")
c <- C(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)

.Vectors of consecutive numbers can be created using the ':' operator.  #ارقام متتابعة 
##Example  # هيطلع الاراقام المتتابعة مثلا من 1 ل5 و هكذا بالثانى
> x <- 1:5; x
[1] 1 2 3 4 5
> y <- 1:-1; Y
[1] 1 0 -1


##ACCESSING ELEMENTS OF ATOMIC VECTORS
• Vector elements can be accessed by vector indexing. The vector can be numeric, character, or logical.
• An individual element of a vector is accessed by its position, which is indicated within square brackets.
##Example
vec <- cl ("a”, "b", "c", "d", "e", "f")
vec [1]  # will return the first element in the vector 
vec [c (2,4)] # will return the second and fourth elements in the vector



****To comment in R software, special character # is placed in the beginning****


2-matrix:
• Matrix is a two-dimensional data structure.
• It is similar to a vector but has the dimension attribute.
##Example
vector <- c (1,2,3, 4)
foo < - matrix (vector, nrow=2, ncol=2)
Elements in a matrix must be of the same type, whether a number, character, or Boolean.


The values for the rows and columns are assigned using row and ncol arguments respectively.
##Example
> matrix (1:9, nrow = 3, ncol = 3, byrow=TRUE)
[,1] [,2] [,3]
[1,]  1  2   3
[2,]  4  5   6
[3,]  7  8   9

#Byrow:
Matrix is filled column-wise. By assigning TRUE to the argument by row, it can be reversed to row-wise filling

.Elements are accessed using the square bracket '[ ]' indexing method.
##Example
>x
[,1] [,2] [,3]
[1,] 1  4  7
[2,] 2  5  8
[3,] 3  6  9
> x [c (1,2), c (2,3)] # select rows 1 & 2 and columns 2 & 3


3-Arays:
Arrays are similar to a matrix but can have more than two dimensions.
Example
A <- array (1: 24, dim = c (3, 4, 2))


• Arrays take vectors as input in the matrix.
• Rows and columns are named using the 'dimnames' parameter.
##Example
vector1 <- c (4,2,1)
vector2 <- c (22, 34, 76,88, 98, 65) 
column.names <- c ("COLI", "COL2", "COL3") 
row. names <- c ("ROW]", "ROW2", "ROW3") 
matrix.names <- c ("Matrix", "Matrix2")
result <- array (c (vectorl, vector2), dim = c (4,2, 1),
dimnames = list (row.names, column.names, matrix.names) )
print (result)

.Using the index position, one can access or change the individual elements in an array.
##Example
print (result [2, ‚1]) # Prints the second row of the first matrix of the array
##Output:
COL1 COL2 COL3
2     34   98


4- Factors:
Factors take only a predefined, finite number of categorical values.
Example
> x
[1] male female female male
Levels: female male

• Factors are created using the factor() function.
• They are built using two attributes: class and levels.
##Example
> × ‹- factor (c ("male", "female", "female", "male"));
> X
[1] male female female male
Levels: female male
> × <- factor (c ("male", "female", "female", "male"), levels = c ("male", "female"));
> X
[1] male female female male
Levels: male female

.Accessing elements of factors is similar to accessing elements of an atomic vector.
##Example
[1] single married married single
Levels: married single
>x [3] #access 3rd element
[1] married
Levels: married single


5-Data frames:
• Data frames are the most commonly used data structures in R.
• A data frame is similar to a general matrix, but its columns can contain different modes of data, such as a number and character.
##Example:
name <- c ("Joe", "John" ,"Nancy")
sex <- c ("M", "M", "F")
age < - c (27,26,26)
df <- data.frame (name, sex, age)

• Data frames are created using the data.frame() function.
• When the argument StringsAsFactors = FALSE is passed, the data.frame() function will not convert character vector into factor.
##Example
df < - data. frame (
Name <- c ("Joe", "John", "Nancy")
Sex <- c ("'M", "M", "F")
Age < - c (27, 26, 26),
StringsAsFactors = FALSE 
)

The data can be accessed using column names.
##Example
result <-data.frame (name$age, name$sex)
print (result)
##Output:
Name   Age  Sex
Joe    27    M
John   26    M
Nancy  26    F 


6- lists:
• Lists are the most complex data structures.
• A list is a vector that has elements of different types.
• A list may contain a combination of vectors, matrices, data frames, and even other lists.
##Example 
vec < - c (1,2,3,4)
mat <- matrix (vec, 2, 2)
List data <- list (vec, mat)

.CREATING LISTS
Lists are created using list() function.
##Example
vec <- c (1, 2,3,4) 
mat <- matrix (vec, 2,2)
list data <- list (vec, mat)
Print (List data)

.ACCESSING ELEMENTS OF LISTS
• List elements can be accessed by indexing.
• The vector can be an integer, character, or logical vector.
##Example
print (list data [mat])

-------------------------------------------------------------------------------------------------------------------------------

# Assigning Values to Data Structures
Now that data structures are identified, the next step is to assign values to the data structure. This is achieved by importing and exporting data from files.

1- Importing Data
Assigning Values to Data Structures
You can import data from four types of files in R:
• Excel
• Minitab
• Table
• CSV

Before using the sample data available in an Excel format, you need to import the data into R.
1.1- Excel:
##Example 1:
library (gdata)
help (read.xls)
mydata = read.xls ("mydata.xls")
#load gdata package #documentation
#read from first sheet

##Example 2:
library (XLConnect)
wk = loadWorkbook ("mydata.xls")
df = readWorksheet (wk, sheet="Sheetl")


1.2-mintab:
Use the function read.mtp to import the sample data from a Minitab
Portable Worksheet format.
• This function returns a list of components in the Minitab worksheet.
##Example:
library (foreign)
help (read.mtp)
mydata = read.mtp ("mydata.mtp")


1.3-table:
• A text file can have a data table in it. The cells inside the table are separated by blank characters.
• Here's an example of a table with four rows and three columns. Let's see how to import this data.
##Example:
help (read.table)
mydata = read.table ("mydata.txt")


1.4- csv:
• Rallows data import from a Comma Separated Values (CSV) format as well.
• Each cell inside such a data file is separated by a special character, usually a comma.
##Example:
help (read.csv)
mydata = read.csv ("mydata.csv", sep=",")

2-Exporting Data:
R supports data export from three types of files:
Table
Excel
CSV


2.1-table:
##Example:
help (write.table)
write.table (mydata, "c:/mydata.txt", sep="It")

2.2-excel:
##Example:
library (xIsx)  # to read the library
help (write.xlsx)
write.xlsx (mydata, "c:/mydata.xlsx")

2.3-csv:
##Example:
help (write.csv)
write.csv (mydate, file = "mydata.csv")

-------------------------------------------------------------------------------------------------------------------------------

# coding by python

BankCustomer <- read.csv ("Demo 2 Assigning values and applying functions.cv")
View (BankCustomer)   # shows the table
install.packages ("plyr")   # هتظهرلى ويندو هل تحملها ولا لا ثم وافق هتتحمل و هتظهر بعد كدة على جمب فى باكجس و مكاتب المستخدم بالجنب اليمين تحت
library(plyr) # استدعام المكاتب بعد التحميل
BankCustomer <- rename(BankCustomer,c("i..age" = "age"))  # chane the age name tp i..age
str (BankCustomer) # shows ever column and the type and data
View BankCustomer)
max (BankCustomer) # the wrong syntax make error XX
max (BankCustomer$Age)  # True
min(BankCustomer$Age) # True
BankCustomerAgeCategorized <- transform(BankCustomer, Generation = ifelse(Age<22, "z", ifelse (Age<41, "Y", ifelse(Age<53, "x",
"Baby Boomers")))
BankCustomerAgeCategorized
#2Way Frequency Table   # هيعمل جدول و يظهر به البيانات دى مع بعض الاول العمود ثم الصف
table (BankCustomerAgeCategorized$Generation, BankCustomerAgeCategorized$poutcome)
table (BankCustomerAgeCategorized)  # يظهر الجدول كله

           ##########################################################################
           
install.packages("readxl")  # لتحميل المكتبة و هتظهر بعد كدة فى مكاتب المستخدم ثم ندوس عليها  لنعمل عليها مثلا
library(readxl)  # لاظهار المكتبة
setwd("C:/Users/Matt/Desktop/") # لاختار مسار من الجهاز
getwd () # لاظهاره لازم كتابة الكلمة دى 
BankCustomer <- read_excel("Demo 1 Identifying Data_Structures.×lsx") # ثم اخيار المف بالمسار 
أو طريقة اخرى :
setwd (choose.dir())  # هيطلعلك ويندو تختار ملف من الجهاز
BankCustomer1 <- read.csv ("Demo 1_Identifying Data Structures.csv") # ثم الملف 
View(BankCustomer)
str(BankCustomer)
BankCustomer <- read_excel("Demo 1 Identifying Data Structures.×lsx", stringsAsFactors=TRUE)
str(BankCustomer) # the code above make categories to factor by makig it print colums and nums do not change and categories as 
                     levels show them  and str will show changing that
BankCustomer2 <- read.csv ("Demo 1 Identifying Data Structures.csv", stringsAsFactors=FALSE)
str(BankCustomer2)     # reject doing it as factors

_______________________________________________________________________________________________________________________________

# Data visualization

![WhatsApp%20Image%202023-03-30%20at%2021.20.12%20%281%29.jpeg](attachment:WhatsApp%20Image%202023-03-30%20at%2021.20.12%20%281%29.jpeg)

The team separated people into groups based on gender and age.
The team used data visualization to simplify the information about causes of death for different age groups. For example, the table can be used to analyze the three things, people in the age group of 24 to 36 are most likely to die of.

# What Is Data Visualization?
Data visualization is a modern equivalent of visual communication that involves the creation and study of the visual representation of data.

##Solving Complex Challenges Using Data Visualization
GE specializes in solving complex challenges related to infrastructure, renewable energy, and affordable health care.

The marketing communications brand group was given the task of analyzing the causes of death of people.

# Data Visualization in R graphis:

**1- Bar charts:**
Bar plots are horizontal or vertical bars used to show comparisons between categorical values. They represent length, frequency, or proportion of categorical values.
Syntax: barplot (x)

##CREATING BAR CHARTS IN R
Use the mtcars dataset (inbuilt in R) to create simple and horizontal bar plots:
*simple bar chart
counts <- table (mtcars$gear)
barplot (counts)

*horizontal bar chart
barplot (counts, horiz=TRUE)

##EDITING BAR CHARTS IN R
Titles, legends, and colors can be added to a simple bar chart using the following code:
*Simple Bar Plot
counts <- table (mtcars$gear)
barplot (counts, main="Simple Bar Plot", xlab="Improvement", ylab="Frequency", legend=rownames (counts), col=c ("red", "yellow"
,"green" ))   # main in name of bar   , legend by what they ditributed  #هيعمل عواميد منفصلة و مختلفة الالوان
##هيعمل الرسمة اللونين على نفس العمود سوا
A stacked bar plot with colors and legends can be created using the following code:
counts <- table (mcars$vs, mtcars$gear)
barplot (counts, main="Car Distribution by Gears and VS", xlab="Number of Gears", col=c ("grey", "cornflowerblue"),
legend = rownames (counts) )
##هيعمل الرسمة عبارة عن عمود من كل مجموعة جمب بعض 
A grouped bar plot can be created using the following code:
Car Distribution by Gears and VS
counts <- table (mcars$vs, mtcars$gear) barplot (counts, main="Car Distribution by
Gears and VS", xlab="Number of Gears", col=c ("grey", "cornflowerblue") legend =rownames (counts), beside=TRUE)


**2- pie chart:**
A pie chart is a graph in which a circle is divided into sectors, each representing a proportion of the whole.
Syntax: pie (attributes)
##CREATING PIE CHARTS IN R
Consider a pie chart that contains 10, 12, 4, 16, and 8 as slices and US, UK,
Australia, Germany, and France as labels. Use pie(x, labels =) function to create the pie chart:
France
slices <- c (10, 12,4, 16, 8)
Ibls <- c ("US",
"OR",
"Australia", "Germany", "France")
pie ( slices, labels = 1bls, main="Simple Pie Chart")

##EDITING PIE CHARTS IN R
Percentages can be added to a pie chart using the following code:
slices <- c (10, 12,4, 16, 8)
pet <-
round (slices/sum (slices) *100)
lbls <- paste (c ("US", "UK", "Australia", "Germany","France"), "", pet, "g"," sep="")
pie (slices, labels=1b1s2, col=rainbow (5),main="Pie Chart with Percentages")

##EDITING PIE CHARTS IN R
A 3-dimensional pie chart can be created as shown:

library (plotrix)
slices <- c (10, 12,4, 16, 8)
lbls <- paste (c ("US","UK","Australia","Germany","France")," ", pct, "%", sep="")
pie3D (slices, labels=lbls, explode=0.0, main="3D Pie Chart")


**3- histogram:**
A histogram represents the distribution of a continuous variable and the frequency of values bucketed into ranges.
Syntax: hist (X)

##CREATING HISTOGRAMS IN R
Creating a simple histogram using the mtcars dataset:
The first step is to "bin" the range of values, i.e., divide the entire range of values into a series of intervals and then count how many values fall into each interval.
Next, use the following code:
mtcars$mpg     #miles per gallon data
hist (mtcars$mpg)

##EDITING HISTOGRAMS IN R
To color histograms with a different number of bins, use the following code:
#Colored Histogram with
Different Number of Bins
hist (mtcars$mpg, breaks=8, col="darkgreen" )
...The function break = controls the number of bins.


**4- Kernel density plot:**
A Kernel density plot shows the distribution of a continuous variable.
Syntax: plot (density (x) )
The Histogram is not a great method for determining the shape of a distribution because it depends on the number of bins used. To aid this, Kernel density plots are used over histograms.

##CREATING A KERNEL DENSITY PLOT IN R
The plot can be created using plot(density(x)), where x is a numeric vector. Use the mtcars dataset in R.
# kernel Density Plot
density data <- density (mtcars$mpg)
plot (density_data)

##EDITING A KERNEL DENSITY PLOT IN R
To add color and border to the plot, use the following codes:
#Filling density Plot with
color density data < - density (mtcars$mpa)
plot (density_data, main="Kernel Density of Miles Per Gallon")
polygon (density_data, col="skyblue", border="black")


**5- Line chart:**
A Line chart is used to represent a series of data points connected by a straight line.
It helps visualize data that changes over time.
Syntax: lines (x, y, type=)
##CREATING A LINE CHART IN R
To create a line chart using plot() function by plotting body weight against months, use the following code:
weight <- c (2.5,2.8, 3.2, 4.8,5.1,5.9,6.8,7.1,7.8,8.1)
months <- c (0,1,2,3,4,5, 6,17,8, 9)
plot (months, weight, type = "b", main="Baby Weight Chart")

##EDITING A LINE CHART IN R
To change the color of the plot, use the following code:
Plot months, weight,
type = "b", color = Red


**6- Boxplot:**
Box plot, also called whisker diagram, displays the distribution of data based on the five-number summary:
• Minimum
• First quartile
• Median
• Third quartile
• Maximum
Syntax: boxplot (data)
##CREATING A BOX PLOT IN R
Use the following code to create a box plot using the inbuilt R dataset "airquality":
boxplot (airquality$Ozone, main = "Mean ozone in parts per billion at Roosevelt Island", xlab = "Parts Per Billion",ylab = "Ozone", horizontal = TRUE, )

##EDITING BOX PLOT IN R
To change the color of the plot, use the following code:
boxplot (airquality$Ozone, main = "Mean ozonein parts per billion at Roosevelt Island",xlab = "Parts Per Billion", ylab = "Ozone" , col = "green",horizontal = TRUE, )


**7- heat map:**
A heat map is a two-dimensional representation of data that uses colors to represent the values. The two types of heat maps are:
• Simple Heat Map: Provides an immediate visual summary of information
• Elaborate Heat Map: Helps in understanding complex data sets
Syntax: heatmap (data, Rowv=NA, Colv=NA)
##CREATING HEAT MAP IN R
To generate a simple heatmap, use the following code:
mat<-as.matrix (mtcars);
heatmap (mat);
..Certain variables with relatively high values absorb all the variance.

##EDITING HEAT MAP IN R - NORMALIZATION
The scale argument of the heatmap is used to normalize the data matrix, as shown below:
heatmap (mat, scale="column")
..In order to adiust the variation between columns, we may set the value of scale as column in the heatmap.

##EDITING HEAT MAP IN R - DENDOGRAM AND REORDERING
A clustering algorithm sorts the order of rows and columns differently in the heatmap based on similarity.
The raw data matrix can be visualized and normalized without reordering columns or utilizing the dendrograms with the following code:
heatmap (mat, Colv = NA, Rowv = NA, scale="column");


**8- wordcloud:**
Word cloud (also called tag clouds) highlights the most commonly cited words in a text using a quick visualization.
Syntax: wordcloud (words = data, freq =freq,min.freq = 2, )
##CREATING WORD CLOUD IN R
To create a word cloud, load the .csv data followed by the required library as shown below:
install.packages ("wordcloud")
library ("wordcloud")
data <- read.csv ("TEXT.csv",header = TRUE)
head (data)
wordcloud (words = data$word, freq = data$freq, min.freq = 2, max.words=100, random.order=FALSE)

##EDITING WORD CLOUD IN R
For an attractive and colorful word cloud, use the code below:
install.packages ("wordcloud")
library ("wordcloud")
data <- read.csv ("TEXT.Sv", header = TRUE)
head (data)
wordcloud (words = data$word, freq = data$freq, min.freq = 2, max.words=100,random.order=FALSE, rot.per=0.35, colors=brewer.pal (8, "Dark2"))

### أى مشكلة تمر ب3 مراحل لنحلها:
problem statement - study - outcome

..The available data should be visualized to understand the interpretation easily. Graphics in R can be used to visualize the data.

..scatter plot that shows the correlation

##GRAPHICS LIMITATIONS
• Plots cannot be saved as obiects
• Multivariate exploration is complex
• Layers are not supported
• Merging graphics is not supported


## What Is ggplot2?
ggplot2 is a data visualization package of R that provides a general scheme for data visualization. It breaks up graphs into semantic components such as scales and layers. It is an alternative for the basic graphics of R.

##Example1:

Creating a bar plot with just one variable with bars (In ggplot, the frequency need not be calculated):
library ("ggplot2")ggplot (hsb, aes (x=read) ) + geom bar ()  #هيعمل توزيع للداتا و هيقسمهم اجزاء كدة موزعين كله جزء مع بعضه 


![WhatsApp%20Image%202023-03-31%20at%2005.01.18.jpeg](attachment:WhatsApp%20Image%202023-03-31%20at%2005.01.18.jpeg)

##Example2:

Creating a Kernel density plan with one variable with a curve line:
ggplot (hsb, aes (x=read) ) + geom density ()


![WhatsApp%20Image%202023-03-31%20at%2005.01.17.jpeg](attachment:WhatsApp%20Image%202023-03-31%20at%2005.01.17.jpeg)

##Example3:

Creating a Histogram using the "airquality" dataset:
ggplot (airquality, aes (x = Ozone))
+geom histogram (aes (y = ..count..), binwidth = 5, colour = "black", fill = "blue")
scale × continuous (name = "Mean ozone in \nparts per billion", breaks = seq(0, 175, 25), limits=c (0, 175))
scale y continuous (name = "Count")
+ ggtitle ("Frequency histogram of mean ozone")  #هيعمل توزيع للداتا و هيقسمهم اجزاء كدة موزعين كله جزء مع بعضه 


![WhatsApp%20Image%202023-03-31%20at%2005.01.18%20%281%29.jpeg](attachment:WhatsApp%20Image%202023-03-31%20at%2005.01.18%20%281%29.jpeg)

##Example4
Creating a box plot using the "airquality" dataset:
airquality$Month <-factor (airquality$Month, labels = c ("May", "Jun", "Jul", "Aug", "Sep"))
ggplot (airquality, aes (x = Month, y = Ozone))
+ geom boxplot (fill = "blue", colour = "black")
+ scale_y_continuous (name = "Mean ozone in \nparts per billion", breaks = seq(0, 175, 25), limits=c (0, 175))
+ scale × discrete (name = "Month"') + ggtitle ("Boxplot of mean ozone by month")#هيعمل كذا واحد جمب بعض كل واحد موزع برده للمقارنة جمب بعض


![WhatsApp%20Image%202023-03-31%20at%2005.01.18%20%282%29.jpeg](attachment:WhatsApp%20Image%202023-03-31%20at%2005.01.18%20%282%29.jpeg)

###Saving a Graphic Output as a File:

##Example:

To save a graphic output as a file, the following code can be used:
jpeg ("myplot.jpg" )
counts <- table (mtcars$gear)
barplot (counts)
dev.off ()

..The dev.off() function returns the control back to the terminal. # للوقوف خلاص و يخرج


![WhatsApp%20Image%202023-03-31%20at%2004.53.07.jpeg](attachment:WhatsApp%20Image%202023-03-31%20at%2004.53.07.jpeg)

# statistic of data science

# Key Takeaways

• Null Hypothesis is performed for a possible rejection under a true assumption and always refers to a specified value of the population parameter, such as u(meo).

• Data sampling is a statistical hypothesis technique used to select, manipulate, and analyze a subset of data points to discover hidden patterns and trends in the larger data set.

• The confidence level is the frequency of possible confidence intervals that contain the true value of their corresponding parameters.

• Level of significance refers to the probability of a Type I error شبه (a), that is, a random value of statistic t belonging to the critical region.
It is usually set at 5% or 1% when employed in hypothesis testing.

• Power of test is the complement of the probability of a Type II error شبه (1-B) and refers to the probability of rejecting H0 when it is false.

• Hypothesis test is a formal procedure in statistics used to test whether a hypothesis can be accepted or not.

• The Z-test is performed in cases where the test statistic is t and شبه o is known.

• The T-test is performed in cases where the test statistic is t and شبه o is unknown.

• The degree of freedom is the number of independent variates that make up the statistic.

• The Chi-Square Test considers the square of a standard normal variate.

• The ANOVA test is used for such hypothesis tests that compare the averages of two or more groups.

• Both parametric and non-parametric tests of the population have a pre-determined value, or the values need to be defined.

**Consider a scenario where a marketing manager must decide whether to launch a new product or not.
On analysis, the manager could arrive at the following decision:
The product will be launched if the company gets a market share of 15% or more.
The product will not be launched if the company gets a market share of less than 15%.
Businesses analyze data to make optimal decisions that maximize profit at minimum risk.
Prediction of such outcomes depends on the acceptance or rejection of a hypothesis.**

# What Is Hypothesis?
Hypothesis literally means assumption. Assumption is a subjective term.
A hypothesis is an assertion or a statement about the state of nature and the true value of an unknown population parameter.

##Hypothesis: Example:
Eating more vegetables leads to weight loss
Brushing teeth everyday reduces cavities
These statements have no supporting data and are hence considered hypotheses.

A hypothesis needs analysis to be validated.

In statistics, most hypotheses are written as "if...then" statements. For example, If I eat more vegetables, then I will lose weight faster.

# Types of Hypothesis
Simple
Hypothesis
Complex
Hypothesis
Null Hypothesis
Alternate
Hypothesis
Statistical
Hypothesis


1- Simple Hypothesis:
In a simple hypothesis, there exists a relationship between two variables;
one is called an independent variable or cause and the other is called a dependent variable or effect.
#Example:
Given total Population = 100
Total No. of Male = 50
Total No. of Female = 50
H: U(MUO)=50


2- Complex Hypothesis:
A complex hypothesis refers to the prediction of relationship between two or more independent variables or two or more dependent variables.
#Example:
Eating vegetables, sleeping more than eight hours a night, and exercising atleast thirty minutes five times a week facilitates weight loss.


3- Null Hypothesis:
-A Null Hypothesis is usually a hypothesis of "no difference." It is denoted as H0.
-Null Hypothesis is performed for a possible rejection under a true assumption and always refers to a specified value of the population parameter, such as u(muo).
#Example:
The population mean is 100
Or
H0: u(muo) = 100


4- Alternate Hypothesis:
An alternate hypothesis is complementary to the null hypothesis. It is denoted by H1.
Alternate hypothesis is used to decide whether to employ a one-tailed test or two-tailed test.
#Example:
For H1:u(muo) = 100, the alternative hypothesis could be:
H1:u(muo) != 100
H1:u(muo) > 100
H1:u(muo) < 100


5- Statistical Hypothesis:
A statistical hypothesis is a method of statistical inference performed using data from a scientific study.
#Example:
Given, total no of cities = 10
Mean population (u(muo)) = 75
H0:u(muo) = 75

-------------------------------------------------------------------------------------------------------------------------------

**What Is Data Sampling?**
Data sampling is a statistical hypothesis technique used to select, manipulate, and analyze a subset of data points to discover hidden patterns and trends in the larger data set.

The sampling theory draws valid inferences about the population parameters on the basis of sample results.

**Chances of Errors in Sampling**
#Consider the following scenarios:
A quality inspector accepts or rejects hardware components supplied by a vendor, generally on the basis of test results of a random sample.

#Consider the following scenarios:
A bank accepts or rejects a loan on the basis of a random sample of test results of loan payback with Interest and tenure.

..In such cases, statistical decisions are taken on the basis of evidence and provide complete confidence to reduce the chances of error.

**Types of Errors**
The errors in statistical decisions are of two types:
1-Type I Error:
• Reject H0 When it is true
• Probability is denoted by شبه (a)

2-Type II Error:
• Accept H0 When it is wrong or H1 is true
• Probability is denoted by شبه (B)


..In practice, a Type I error means rejecting a lot when it is good (producer's risk) and Type lI error means accepting a lot when it is bad (consumer's risk).



**Confidence Levels**
The confidence level is the frequency of possible confidence intervals that contain the true value of their corresponding parameter.
The confidence level is the frequency of possible confidence intervals that contain the true value of their corresponding parameter.
Confidence interval for a normal distribution is evaluated for continuous variables like sales in dollars, income of customers, age of customers, score in mathematics, etc.


![WhatsApp%20Image%202023-04-01%20at%2005.07.09.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.09.jpeg)

**Critical Region**
The sampling distribution of a test statistic has two regions_a region of rejection (critical region) and a region of acceptance.
The critical region amounts to rejection of H0 corresponding to the test statistic t in the sample space S.


![WhatsApp%20Image%202023-04-01%20at%2005.07.10.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.10.jpeg)

**Decision Making**
The critical region helps in decision making by defining the region of acceptance and region of rejection. A decision can be correct or incorrect.
#Correct Decision:
If sample test statistic falls in the rejection region, reject H0

..In the decision making approach to hypothesis testing, it is crucial to decide the level of significance prior to the collection of the sample data.

**Level of Significance**
Level of significance refers to the probability of a Type I error شبه (a), that is, a random value of statistic t belonging to the critical region. It is usually set at 5% or 1% when employed in hypothesis testing.
• If شبه (a) = 0.05 and you reject H0 then there is a 5% probability that you have rejected H. when it is true.
• The desired level of significance depends on the amount of risk you want to take in rejecting Ho when it is true.

**Confidence Coefficient**
Confidence coefficient is the complement of the probability of a Type I error شبه (1-a)
that yields confidence level when multiplied by 100%.
It represents the probability of concluding that a specific value of parameter being tested under H is possible when, in fact, it is true.
It is a measure of accuracy and repeatability of a statistical test.

**شبه(B) Risk**  #bete risk
شبه B risk is the probability of committing a Type I error and depends on the difference between the hypothesized and actual values of the population parameter.
It is inversely proportional to شبه a.
Beta risk depends on the magnitude of the difference between sample means and is managed by increasing test sample size.

**Power of Test**
The value شبه (1- B) is known as the "power" of a statistical test.
It is the complement of the probability of a Type Il error شبه (1-B) and refers to the probability of rejecting H0 when it is false.

**Factors Affecting the Power of Test**
•Population Standard Deviation: Inversely proportional
•Sample Size Used: Directly proportional
•Level of Significance: Directly proportional

**What Is Hypothesis Test?**
A hypothesis test is a formal procedure in statistics used to test whether a hypothesis can be accepted or not.
It is used to infer the results of a hypothesis performed on sample data to a large population.
..The testing methodology depends on the data used and the reason for the analysis.

**Types of Hypothesis Test**
Simple Hypothesis Test
Complex Hypothesis Test
Null Hypothesis Test
Alternative Hypothesis Test
Statistical Hypothesis Test
+
Parametric Test
Non-Parametric Test


**What Is a Parametric Test?**
A parametric statistical test is one that makes assumptions about the parameters (defining properties) of the population distribution(S)
from which one's data is drawn.
In these tests, inferences are based on the assumptions made about the nature of the population distribution. The tests are used for normal data.

**Types of Parametric Tests**
1- Z-Test and T-Test: Two population means or proportions are compared and tested.
2- Analysis of Variance (ANOVA) Test: Equality of several population means is tested.


###Z-Test:
Z-Test is performed in cases where the test statistic is t, شبه o is known, the population is normal, and the sample size is at least 30.
The formula to calculate z (standard statistic) is:

Where,
n: Sample number
X: Sample mean from a sample X,, X,,.., X,
شبه u: Population mean
شبه o: Standard Deviation
Z=

![WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%281%29.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%281%29.jpeg)

..lower.tail = TRUE is used to find the probability of values
no larger than z, whereas lower.tail = FALSE is used to
find the probability of values z or larger.

###T-Test:
T-Test is performed in cases where the test statistic is t, o is unknown, sample standard deviation is known, and the population is normal.
The formula to calculate t is:
Where,
n: Sample number
X: Sample mean from a sample X,, X,,.., X,
M: Population mean
t=

![WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%282%29.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%282%29.jpeg)

..Degree of freedom refers to the number of values in the final calculation of a test statistic that varies freely. It is
calculated using the formula df = N-1 (where N is the
number of values in a dataset).

###ANOVA:
The ANOVA test is used for hypothesis tests that compare the averages of two or more groups.
For example, consider the following statements:
• An environmentalist wants to know if the average amount of pollution varies in several bodies of water.
• A sociologist wants to find out if a person's income varies according to his/her upbringing.

**TYPES ANOVA**
One-Way ANOVA
Two-Way ANOVA

1- One-Way ANOVA:
• Uses variances to determine if a statistically significant difference exists among several group means or not
• Tests H0: شبه u1 = u2 = u3 = ... = u.. (where, u = group mean and k = number of groups)
..For one-way ANOVA, the ratio of the between-group variability to the within-group variability follows an F-distribution when the null hypothesis is true.

#ASSUMPTIONS:
1- All samples are random and independent
2- Each population is normal
3- The factor is a categorical variable
4- The populations have equal standard deviations
5- The result is a numerical variable


**F-Distribution**
F-distribution or the Fisher-Snedecor distribution is a continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance (ANOVA).
F-Ratio refers to the value derived from two estimates of the variance, as described below:
• Variance between samples (SSbetween): It is an estimate of شبه o2: variance of the sample means * n, when the sample sizes are the same. When sizes are different, the variance is weighted to account for different sample sizes.
• Variance within samples (SSwithin): It is an estimate of شبه o2: average of sample variances. When sizes are different, the variance within samples is weighted.


**Types of Parametric Tests**
**anova**
2- Two-Way ANOVA:
Two-way ANOVA refers to a hypothesis test where the classification of data is based on two independent variables
For example:
A company bases its sales classification by identifying the sales by a salesman and sales by region

##ASSUMPTIONS:
1- Normal distribution of the population sample
2- Measurement of dependent variable at continuous level
3- Categorical independent groups that have the same size
4- Independence of observations
5-Homogeneity of the variance of the population

-------------------------------------------------------------------------------------------------------------------------------
**What Is a Non-Parametric Test?**
A non-parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution. It is used when the data is not distributed normally.
It refers to a null category, since virtually all statistical tests assume one thing or another about the properties of the source population(s).

**Types of Non-Parametric Tests**
• Kruskal Willis test (alternative to the One way ANOVA)
• Mann Whitney test (alternative to the two sample t-test)
• Chi-square test

**What Is Chi-square Test?**
Chi-square test is a nonparametric test used to compare two or more variables for randomly selected data.

#Chi-Square Test:
1- Considers the square of a standard normal variate
2-Evaluates if frequencies observed in different categories vary significantly from the frequencies expected under a specified set of assumptions
3- Determines how well an  assumed distribution fits the data
4- Uses contingency tables (in market researches, these tables are called cross-tabs)
5- Supports nominal-level measurements

**Types of Chi-square Test**
1- Chi-square test for goodness of fit
2- Chi-square test for independence of two variables

1- Chi-square test for goodness of fit:
It is used to observe the closeness of a sample that matches a population. The Chi-square test statistic (x^2) is=

with k-1 degrees of freedom.
Where O is the observed count, k is categories, and E is the expected counts

![WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%283%29.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.10%20%283%29.jpeg)

Goodness of fit of a statistical model refers to the understanding of how well sample data fits a set of observations.

#Goodness of fit test is used to identify the relation between two attributes, as in the cases below:
• Credit worthiness of borrowers based on their age groups and personal loans
• Relation between the performance of salesmen and training received
• Return on a single stock and on stocks of a sector like pharmaceutical or banking
• Category of viewers and impact of a TV campaign

2- Chi-square test for independence of two variables:
It is used to check whether the variables are independent of each other or not. The Chi-square test statistic (x^2) is=

With (r-1) (c-1) degrees of freedom
Where O, is the observed count, r is number of rows, c is the number of columns, and E. is the expected counts

![WhatsApp%20Image%202023-04-01%20at%2005.07.11.jpeg](attachment:WhatsApp%20Image%202023-04-01%20at%2005.07.11.jpeg)

Two random variables are called independent if the probability distribution of one variable is not affected by the other.

#Test of independence is suitable for the following situations:
• There is one categorical variable.
•There are two categorical variables, and you will need to determine the relation between them.
• There are cross-tabulations, and relation between two categorical variables needs to be found.
• There are non-quantifiable variables (For example, answers to questions like, do employees in different age groups choose different types of health plans?)

..As p > significance level, H, is not rejected.

**Hypothesis Test around Mean, Variance, and Proportion**
Both parametric and non-parametric hypothesis tests are used to check whether the mean, variance, and proportion of the population have pre-determined values or if the values need to be defined.


>>Hypothesis Tests about Population Means:
Hypothesis tests about population means involve testing the hypothesis that compares the population mean of interest with a specified value.

#ASSUMPTION:
X1, X2, ………….., Xn, is a sample of size n from a normal population with mean and variance.
The mean X is distributed normally with the mean u and variance شبه o^2/n (X~ N (u, o^2/n)).
If n is large, X will be calculated similarly, even if the sample is from a non-normal population.
Therefore, for large samples, the standard normal variable corresponding to X bar is Z (as calculated in the Z-test).

##WHEN POPULATION VARIANCE IS KNOWN
Consider a random large sample of size n, with a sample mean Xشرطة
Test the hypothesis that the sample mean X has been drawn from a population with the mean u and a specified value Mo, that is:
• H0:u = u0
• H1:u != u0
• H1:u > u0
• H1:u < u0
          
Under null hypothesis, Z = (X - شبهu0)/S.E.(X) follows Standard Normal Distribution approximately.

..When population variance is unknown, Z test is used.


##WHEN POPULATION VARIANCE IS UNKNOWN
Consider the following hypothesis formation:
• H0: u = u0
• H1: u != u0

If u0 falls in the confidence interval, the test result is "failing to reject the null hypothesis"; if not, the result is "reject the null hypothesis.

..When population variance is unknown, T test is used.


>>Hypothesis Tests about variance:
Hypothesis test about population variance involves finding the squared deviation of a random variable from its mean. It measures how far a set of (random) numbers are spread out from their average value.

Hypothesis Tests about Population Variance FORMULA:
Consider the case where data consists of a simple random sample drawn from a normally distributed population. The test statistic for testing hypotheses about a single population variance is calculated as:
X2 = (n-1) s^2 / شبه o^2

Chi-square test is used in hypothesis tests of population variance.


>>Hypothesis Tests about Population Proportions:
Hypothesis Tests about population proportions are defined as the ratio of the values in a subset S to the values in a set R.

Hypothesis Tests about Population Proportions FORMULA:
Consider a random sample of the size n and the proportion of members with a certain attribute p.
You need to test the hypothesis that the proportion P in the population has a specified value P0 that is:
• H0:P = P0
• H1:P != P0
• H1:P > P0
• H1:P < P0
For a large sample, Z = (p - P0)/S.E.(P) ~ N (0,1) (under H)
Where,
p = X/n = Number of successes in sample/sample size
P0 = Hypothesized proportion of successes in the population
    
    
**The Classic Anecdote of Beer and Diaper**
The store collected data using the barcode scanners during the payment and stored the data in a database. A single record lists all the items purchased by a customer that was later analyzed to understand the trend.
The technique used is called "Market Basket Analysis" better known as "Association Rule.'

**Association Rule**
An Association rule is a classic data mining technique that finds interesting patterns or relations in a dataset.
The relation between the order of an item and the frequency of its occurrence is known as Interesting Relation.
