<h1> R programming </h1>

<h2> Data Preparation </h2>

<h3>Step 1: Defining the Problem Statement</h3>

<h4>Overview:</h4>
<p>Defining the problem statement is the foundational step in data preparation and analysis. This involves understanding the objective, key questions, and the scope of the analysis.</p>

<h4>How to Define a Problem Statement:</h4>
<ul>
  <li><b>Understand the Business Context:</b> Engage with stakeholders, ask questions about the problem and decisions needed.</li>
  <li><b>Identify Key Objectives:</b> Determine primary and secondary objectives (e.g., identify products with high sales velocity).</li>
  <li><b>Formulate the Problem Statement:</b> Convert business objectives into a precise, actionable problem statement.</li>
  <li><b>Scope the Data Required:</b> Decide on the necessary data (e.g., product IDs, sales, customer demographics).</li>
  <li><b>Set Success Criteria:</b> Define how success will be measured (e.g., accurate demand forecasting).</li>
</ul>

<h3>Step 2: Data Collection</h3>

<h4>Importing Data into R:</h4>
<p>Use R to load the collected data:</p>
<ul>
  <li><b>CSV Files:</b> Use functions like <code>read.csv()</code> or <code>read.table()</code>.</li>
  <li><b>Text Files:</b> Use <code>read.delim()</code> for tab-delimited data.</li>
  <li><b>JSON Files:</b> Use the <code>rjson</code> package to read and convert JSON files.</li>
  <li><b>Excel Files:</b> Use the <code>readxl</code> package to handle Excel files.</li>
</ul>


In [None]:
# Import CSV file
url= "/content/imdb_top_1000.csv"
data_csv <- read.csv(url)
# Display the first few rows of the data
head(data_csv)

Unnamed: 0_level_0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
1,"https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg",The Shawshank Redemption,1994,A,142 min,Drama,9.3,"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.",80,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
2,"https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg",The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.,100,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
3,"https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg",The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,"When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",84,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
4,"https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg",The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,"The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.",90,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
5,"https://m.media-amazon.com/images/M/MV5BMWU4N2FjNzYtNTVkNC00NzQ0LTg0MjAtYTJlMjFhNGUxZDFmXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg",12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.,96,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
6,"https://m.media-amazon.com/images/M/MV5BNzA5ZDNlZWMtM2NhNS00NDJjLTk4NDItYTRmY2EwMWZlMTY3XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UX67_CR0,0,67,98_AL_.jpg",The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men against Sauron's army to draw his gaze from Frodo and Sam as they approach Mount Doom with the One Ring.,94,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


In [None]:
# dataset in R
print(mtcars)
?mtcars # Use the question mark to get information about the data set

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0   

<h3> Step 3: Data Preprocessing

In [None]:
# Use dim() to find the dimension of the data set
dim(data_csv)

# Use names() to find the names of the variables from the data set
names(data_csv)

# Use rownames() function to get the name of each row in the first column
rownames(data_csv)

In [None]:
# View the structure of the data
str(data_csv)

'data.frame':	1000 obs. of  16 variables:
 $ Poster_Link  : chr  "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ "https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg" "https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ ...
 $ Series_Title : chr  "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
 $ Released_Year: chr  "1994" "1972" "2008" "1974" ...
 $ Certificate  : chr  "A" "A" "UA" "A" ...
 $ Runtime      : chr  "142 min" "175 min" "152 min" "202 min" ...
 $ Genre        : chr  "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
 $ IMDB_Rating  : num  9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...


In [None]:
# Get a summary of the data
summary(data_csv)

 Poster_Link        Series_Title       Released_Year      Certificate       
 Length:1000        Length:1000        Length:1000        Length:1000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Runtime             Genre            IMDB_Rating      Overview        
 Length:1000        Length:1000        Min.   :7.600   Length:1000       
 Class :character   Class :character   1st Qu.:7.700   Class :character  
 Mode  :character   Mode  :character   Median :7.900   Mode  :character  
                                       Mean   :7.949                     
              

In [None]:
# which.max() and which.min() functions to find the index position of the max and min value
which.max(data_csv$No_of_Votes)
which.min(data_csv$No_of_Votes)

In [None]:
# mean
mean(data_csv$No_of_Votes)

# median
median(data_csv$No_of_Votes)

# mode
mode(data_csv$No_of_Votes)

In [None]:
# quantile() function
quantile(data_csv$No_of_Votes)

In [None]:
# Check for missing values
missing_val <- is.na(data_csv)
colSums(missing_val)

<h2> Data Preprocessing using dplyr library </h2>

In [None]:
# Installing dplyr library
install.packages("dplyr")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Importing the library
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [None]:
# Import CSV file using read.csv()
data <- read.csv("both_sexes.csv", header = TRUE)

# Display the first few rows of the data
head(data)

Unnamed: 0_level_0,X,year,date,all_2534,HS_2534,SC_2534,BAp_2534,BAo_2534,GD_2534,White_2534,⋯,kids_SC_2534,kids_BAp_2534,kids_BAo_2534,kids_GD_2534,nokids_poor_2534,nokids_mid_2534,nokids_rich_2534,kids_poor_2534,kids_mid_2534,kids_rich_2534
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1960,1960-01-01,0.1233145,0.1095332,0.1522818,0.2389952,0.2389952,,0.1164848,⋯,0.001150824,0.0005751073,0.0005751073,,0.4933061,0.410008,0.4921184,0.008722711,0.0007532065,0.0008027331
2,2,1970,1970-01-01,0.1269715,0.1094,0.1495096,0.2187031,0.2187031,,0.1179043,⋯,0.003699982,0.0014683425,0.0014683425,,0.5097742,0.3764538,0.4288948,0.029974945,0.0033771145,0.0030435661
3,3,1980,1980-01-01,0.1991767,0.1617313,0.2236916,0.2881646,0.2881646,,0.1824126,⋯,0.018135401,0.0062544364,0.0062544364,,0.5740402,0.399825,0.3848089,0.077926214,0.0102368871,0.0068317224
4,4,1990,1990-01-01,0.2968306,0.2777491,0.2780912,0.3612968,0.3656655,0.3474505,0.2639256,⋯,0.052032702,0.0171241042,0.0181766027,0.01374234,0.6546908,0.5186604,0.4750156,0.170763774,0.0274655254,0.0182329127
5,5,2000,2000-01-01,0.3450087,0.3316545,0.3249205,0.3874906,0.3939579,0.369174,0.3127149,⋯,0.09762531,0.0370024452,0.0401009875,0.02761467,0.7055451,0.5690228,0.4458023,0.256281918,0.0597845173,0.0295644698
6,6,2001,2001-01-01,0.3527767,0.3446069,0.3341101,0.3835686,0.3925148,0.3590304,0.3183506,⋯,0.110030662,0.0399801447,0.0445838012,0.02645041,0.7147334,0.5864741,0.4461111,0.280146488,0.0677954572,0.0336540502


In [None]:
# Inspecting the data
glimpse(data)
summary(data)

Rows: 17
Columns: 75
$ X                [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ year             [3m[90m<int>[39m[23m 1960, 1970, 1980, 1990, 2000, 2001, 2002, 2003, 2004,…
$ date             [3m[90m<chr>[39m[23m "1960-01-01", "1970-01-01", "1980-01-01", "1990-01-01…
$ all_2534         [3m[90m<dbl>[39m[23m 0.1233145, 0.1269715, 0.1991767, 0.2968306, 0.3450087…
$ HS_2534          [3m[90m<dbl>[39m[23m 0.1095332, 0.1094000, 0.1617313, 0.2777491, 0.3316545…
$ SC_2534          [3m[90m<dbl>[39m[23m 0.1522818, 0.1495096, 0.2236916, 0.2780912, 0.3249205…
$ BAp_2534         [3m[90m<dbl>[39m[23m 0.2389952, 0.2187031, 0.2881646, 0.3612968, 0.3874906…
$ BAo_2534         [3m[90m<dbl>[39m[23m 0.2389952, 0.2187031, 0.2881646, 0.3656655, 0.3939579…
$ GD_2534          [3m[90m<dbl>[39m[23m NA, NA, NA, 0.3474505, 0.3691740, 0.3590304, 0.351284…
$ White_2534       [3m[90m<dbl>[39m[23m 0.1164848, 0.1179043, 0.1824126, 0.2639256, 

       X           year          date              all_2534     
 Min.   : 1   Min.   :1960   Length:17          Min.   :0.1233  
 1st Qu.: 5   1st Qu.:2000   Class :character   1st Qu.:0.3450  
 Median : 9   Median :2004   Mode  :character   Median :0.3673  
 Mean   : 9   Mean   :1999                      Mean   :0.3587  
 3rd Qu.:13   3rd Qu.:2008                      3rd Qu.:0.4394  
 Max.   :17   Max.   :2012                      Max.   :0.4943  
                                                                
    HS_2534          SC_2534          BAp_2534         BAo_2534     
 Min.   :0.1094   Min.   :0.1495   Min.   :0.2187   Min.   :0.2187  
 1st Qu.:0.3317   1st Qu.:0.3249   1st Qu.:0.3774   1st Qu.:0.3871  
 Median :0.3708   Median :0.3451   Median :0.3875   Median :0.4000  
 Mean   :0.3617   Mean   :0.3481   Mean   :0.3843   Mean   :0.3968  
 3rd Qu.:0.4599   3rd Qu.:0.4235   3rd Qu.:0.4298   3rd Qu.:0.4474  
 Max.   :0.5235   Max.   :0.4799   Max.   :0.4766   Max.   :0.5023

<h3> Handling Missing Values

In [None]:
# Count missing values in each column
data %>% summarise(across(everything(), ~ sum(is.na(.))))

X,year,date,all_2534,HS_2534,SC_2534,BAp_2534,BAo_2534,GD_2534,White_2534,⋯,kids_SC_2534,kids_BAp_2534,kids_BAo_2534,kids_GD_2534,nokids_poor_2534,nokids_mid_2534,nokids_rich_2534,kids_poor_2534,kids_mid_2534,kids_rich_2534
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0,0,3,0,⋯,0,0,0,3,0,0,0,0,0,0


In [None]:
install.packages("tidyr")
library(tidyr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Remove the rows with any missing values
data_clean <- data %>% drop_na()

mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

In [None]:
# Replacing missing values with column mean
data_clean <- data %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm= TRUE), .)))

In [None]:
head(data_clean)

Unnamed: 0_level_0,X,year,date,all_2534,HS_2534,SC_2534,BAp_2534,BAo_2534,GD_2534,White_2534,⋯,kids_SC_2534,kids_BAp_2534,kids_BAo_2534,kids_GD_2534,nokids_poor_2534,nokids_mid_2534,nokids_rich_2534,kids_poor_2534,kids_mid_2534,kids_rich_2534
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1960,1960-01-01,0.1233145,0.1095332,0.1522818,0.2389952,0.2389952,0.3746516,0.1164848,⋯,0.001150824,0.0005751073,0.0005751073,0.03369062,0.4933061,0.410008,0.4921184,0.008722711,0.0007532065,0.0008027331
2,2,1970,1970-01-01,0.1269715,0.1094,0.1495096,0.2187031,0.2187031,0.3746516,0.1179043,⋯,0.003699982,0.0014683425,0.0014683425,0.03369062,0.5097742,0.3764538,0.4288948,0.029974945,0.0033771145,0.0030435661
3,3,1980,1980-01-01,0.1991767,0.1617313,0.2236916,0.2881646,0.2881646,0.3746516,0.1824126,⋯,0.018135401,0.0062544364,0.0062544364,0.03369062,0.5740402,0.399825,0.3848089,0.077926214,0.0102368871,0.0068317224
4,4,1990,1990-01-01,0.2968306,0.2777491,0.2780912,0.3612968,0.3656655,0.3474505,0.2639256,⋯,0.052032702,0.0171241042,0.0181766027,0.01374234,0.6546908,0.5186604,0.4750156,0.170763774,0.0274655254,0.0182329127
5,5,2000,2000-01-01,0.3450087,0.3316545,0.3249205,0.3874906,0.3939579,0.369174,0.3127149,⋯,0.09762531,0.0370024452,0.0401009875,0.02761467,0.7055451,0.5690228,0.4458023,0.256281918,0.0597845173,0.0295644698
6,6,2001,2001-01-01,0.3527767,0.3446069,0.3341101,0.3835686,0.3925148,0.3590304,0.3183506,⋯,0.110030662,0.0399801447,0.0445838012,0.02645041,0.7147334,0.5864741,0.4461111,0.280146488,0.0677954572,0.0336540502


<h3> Removing Outliers with dplyr </h3>
<p><strong>Z-Score Method:</strong> This method calculates the Z-scores of data points. Any point with an absolute Z-score above the threshold (usually 3) is considered an outlier.</p> <br>
remove_outliers_z <- function(df, column, threshold = 3) {<br>
  z_scores <- scale(df[[column]]) <br>
  df %>% filter(abs(z_scores) < threshold)<br>
}

<p><strong>Percentile Method:</strong> This method removes data points outside a specified percentile range, such as below the 1st percentile or above the 99th percentile.</p><br>
remove_outliers_percentile <- function(df, column, lower_pct = 0.01, upper_pct<br> = 0.99) {<br>
  lower_bound <- quantile(df[[column]], lower_pct)<br>
  upper_bound <- quantile(df[[column]], upper_pct)<br>
  
  df %>% filter(df[[column]] >= lower_bound & df[[column]] <= upper_bound)<br>
}

<p><strong>IQR Method:</strong> This method removes data points that fall outside the range of Q1 - 1.5 * IQR to Q3 + 1.5 * IQR.</p><br>
remove_outliers_iqr <- function(df, column) {<br>
  Q1 <- quantile(df[[column]], 0.25)<br>
  Q3 <- quantile(df[[column]], 0.75)<br>
  IQR <- Q3 - Q1<br>
  
  df %>% filter(df[[column]] >= (Q1 - 1.5 * IQR) & df[[column]] <= (Q3 + 1.5 * IQR))<br>
}

In [None]:
# Function to remove the outliers using iqr method
remove_outliers <- function(df, column){
  Q1 <- quantile(df[[column]], 0.25)
  Q3 <- quantile(df[[column]], 0.75)
  IQR <- Q3 - Q1

  df %>% filter(df[[column]] >=(1-1.5*IQR) & df[[column]] <= (Q3+ 1.5*IQR))
}

# Removing outliers
data <- remove_outliers(data, "all_2534")

In [None]:
# Min-Max Normalization using dplyr
data_normalized <- data %>%
  mutate(across(where(is.numeric), ~ (. - min(.)) / (max(.) - min(.))))

[1m[22m[36mℹ[39m In argument: `across(where(is.numeric), ~(. - min(.))/(max(.) - min(.)))`.
[33m![39m no non-missing arguments to min; returning Inf


In [None]:
# Install and load the necessary package
install.packages("caret")
library(caret)

# One-Hot Encoding for categorical columns
data_encoded <- dummyVars(" ~ .", data = data)
data_transformed <- as.data.frame(predict(data_encoded, newdata = data))

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’


Loading required package: ggplot2

Loading required package: lattice



ERROR: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
