<p align="right"><i>Data Analysis for the Social Sciences - Part II - 2022/23</i></p>

# Quantitative Data Analysis

The purpose of this notebook is to demonstrate how to prepare the *National Survey of Sexual Attitudes and Lifestyles, 2010-2012: Teaching Dataset* for analysis in RStudio. Specifically it demonstrates how to:
* Handle missing values 
* Create new variables
* Label the values of categorical variables

### Importing data

The first step is to import the *Natsal-3* data.

In [103]:
natsal <- read.table("C:/Users/77901764/Dropbox/uws/teaching/dass/datasets/natsal/UKDA-8735-tab/tab/natsal_3_teaching.tab", 
                     header=TRUE, strip.white = TRUE, stringsAsFactors = FALSE,
                     na.strings = c("NA", ""), sep="\t")
head(natsal) # view the first six observations

Unnamed: 0_level_0,sin2,dateyoi,total_wt,psu_scrm,strata,stratagrp,stratagrp2,stratagrp3,dage,rdoby,⋯,netacc,adj_imd_quintile,qimd,qwimd,qsimd,tenure,livehere,gor_l,urindew,urindsc
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,9110103,2010,0.4605145,1636,763,332,221,166,25,1985,⋯,1,3,-1,2,-1,4,2,12,5,9
2,9110105,2010,1.7889542,1636,763,332,221,166,30,1980,⋯,1,4,-1,4,-1,4,2,12,5,9
3,9110109,2010,1.3815434,1636,763,332,221,166,29,1981,⋯,1,3,-1,2,-1,4,2,12,5,9
4,9110110,2010,0.7390113,1636,763,332,221,166,27,1983,⋯,2,4,-1,4,-1,4,2,12,5,9
5,9110112,2010,1.9239441,1636,763,332,221,166,41,1969,⋯,1,3,-1,2,-1,2,2,12,5,9
6,9110115,2010,3.2954182,1636,763,332,221,166,43,1967,⋯,1,2,-1,1,-1,2,1,12,5,9


Let's get a list of variable names.

In [104]:
names(natsal)

### Missing values

It is important that missing values are clearly identified in datasets: it is not good practice to simply leave a cell blank in a spreadsheet for example. That is why you will see specific codes used to represent missing values in social surveys. 

For instance, consider the `dage1ch` variable in the *Natsal* dataset, which captures the age at which an individual had their first child. Clearly this question is not relevant to people without a child, and it is also plausible that individuals will not want to answer this question for a number of reasons. Therefore we need a consistent and sensible way of identifying individuals that did not provide / have information for this variable.

In [105]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    18.0    15.3    25.0    99.0 

Note the values `-1` and `99`: after consulting the [codebook](./codebook/8735_natsal_teaching_codebook_v1.pdf) we see that these values represent the "Not applicable" and "Not answered" respectively.

You may wonder why I'm labouring this point. Well the crux of the matter is this: while we know that `-1` and `99` represent missing values, *R* does not! Therefore these values are included in any analyses we perform using this variable i.e., the median age is 18. Watch what happens when we tell *R* how to handle missing values:

In [106]:
natsal$dage1ch[natsal$dage1ch==-1 | natsal$dage1ch==99] <- NA # convert "-1" and "99" to missing

In [107]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

The median age (and the other summaries) are now more accurate as invalid / missing values are excluded.

**In *R* missing values are recorded as `NA`.**

### Creating variables

Often we may want to create new variables so that we do not overwrite existing variables. Or we may want to create a new variable that is a derivation of an existing variable e.g., age groups based on specific age values.

#### Creating a copy of an existing variable

In [108]:
natsal$dage1ch_copy <- natsal$dage1ch

In [109]:
summary(natsal$dage1ch_copy)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

In [110]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

#### Create a derived variable

There are a number of ways of creating categorical variables from a numeric variable - there is a manual way:

In [111]:
natsal$dage1ch_grp <- NA # create a blank variable

natsal$dage1ch_grp[natsal$dage1ch >= 15 & natsal$dage1ch < 20] <- 1 # first age group
natsal$dage1ch_grp[natsal$dage1ch >= 20 & natsal$dage1ch < 25] <- 2 # second age group etc
natsal$dage1ch_grp[natsal$dage1ch >= 25 & natsal$dage1ch < 30] <- 3
natsal$dage1ch_grp[natsal$dage1ch >= 30 & natsal$dage1ch < 35] <- 4
natsal$dage1ch_grp[natsal$dage1ch >= 35 & natsal$dage1ch <= 100] <- 5

In [112]:
natsal$dage1ch_grp <- factor(natsal$dage1ch_grp, levels = c(1,2,3,4,5), labels = c("15-19", "20-24", "25-29", "30-34", "35-40"))

In [113]:
table(natsal$dage1ch_grp)


15-19 20-24 25-29 30-34 35-40 
 1363  2627  2250  1139   422 

And a more efficient, programmatic way:

In [114]:
natsal$dage1ch_grp2 <- cut(natsal$dage1ch, c(15,20,25,30,35,100), right=FALSE, labels=c("15-19", "20-24", "25-29", "30-34", "35-40"))

In [115]:
table(natsal$dage1ch_grp2)


15-19 20-24 25-29 30-34 35-40 
 1363  2627  2250  1139   422 

### Labelling variables

This dataset is in good shape as it has been prepared with teaching in mind. However one way it can be improved is by labelling values of the categorical variables.

There are two ways that categorical variables can be labelled:
* Including missing values
* Excluding missing values

Let's say we're interested in the `religimp` variable. The values of this variable are stored as numbers like so:

In [116]:
table(natsal$religimp)


   1    2    3    4    9 
2163 3773 4414 4755   57 

While the [codebook](./codebook/8735_natsal_teaching_codebook_v1.pdf) tells us what categories these numbers represent, it would be more efficient and legible if we attached labels to these numbers in *R*.

#### Including missing values

In [117]:
natsal$religimp_miss <- factor(natsal$religimp, levels = c(1,2,3,4, 9), labels = c("Very important", "Fairly important", 
                                                                                         "Not very important", "Not important at all",
                                                                                          "Not answered"))

In [118]:
table(natsal$religimp_miss)


      Very important     Fairly important   Not very important 
                2163                 3773                 4414 
Not important at all         Not answered 
                4755                   57 

In general it is better to present analyses that **exclude** missing values (e.g., "Not answered" / "Not applicable" etc), while making a note of how many missing values there are in case you want to report this when writing up.

#### Excluding missing values

In [119]:
natsal$religimp <- factor(natsal$religimp, levels = c(1,2,3,4), labels = c("Very important", "Fairly important", 
                                                                                         "Not very important", "Not important at all"))

In [120]:
table(natsal$religimp)


      Very important     Fairly important   Not very important 
                2163                 3773                 4414 
Not important at all 
                4755 

### More data cleaning

The following are examples of data cleaning tasks for an incomplete list of the variables contained in the *Natsal* assessment dataset. Feel free to implement / adapt for your own data cleaning activities.

#### `rsex`

In [121]:
natsal$rsex <- factor(natsal$rsex, levels = c(1,2), labels = c("Male", "Female"))

In [122]:
table(natsal$rsex)


  Male Female 
  6293   8869 

#### `ethnicgrp`

In [123]:
natsal$ethnicgrp <- factor(natsal$ethnicgrp, levels = c(1,2,9), labels = c("White", "Non-white", "Not answered"))

In [124]:
table(natsal$ethnicgrp)


       White    Non-white Not answered 
       13351          317           47 

#### `sexid`

In [125]:
natsal$sexid <- factor(natsal$sexid, levels = c(1,2,3,4,9), labels = c("Heterosexual/straight", "Gay/lesbian", 
                                                                                   "Bisexual", "Other", "Not answered"))

In [126]:
table(natsal$sexid)


Heterosexual/straight           Gay/lesbian              Bisexual 
                14617                   213                   226 
                Other          Not answered 
                   53                    53 

### Health and Disability

#### `health`

In [127]:
natsal$health <- factor(natsal$health, levels = c(1,2,3,4,5,9), labels = c("Very good", "Good", 
                                                                                   "Fair", "Bad", "Very bad", "Not answered"))

In [128]:
table(natsal$health)


   Very good         Good         Fair          Bad     Very bad Not answered 
        6041         6357         2116          522          124            2 

#### `disabil2`

In [129]:
natsal$disabil2 <- factor(natsal$disabil2, levels = c(1,2,3,9), labels = c("None", "Non-limiting", 
                                                                                   "Limiting", "Not answered"))

In [130]:
table(natsal$disabil2)


        None Non-limiting     Limiting Not answered 
       10536         2037         2586            3 

### Alcohol and smoking

#### `drink`

In [131]:
natsal$drink <- factor(natsal$drink, levels = c(1,2,9), labels = c("Yes", "No", "Not answered"))

In [132]:
table(natsal$drink)


         Yes           No Not answered 
       12340         2822            0 

#### `alcohol2`

In [133]:
natsal$alcohol2 <- factor(natsal$alcohol2, levels = c(0,1,2,9), labels = c("None", "Not more than recommended", 
                                                                                       "More than recommended", "Not answered"))

In [134]:
table(natsal$alcohol2)


                     None Not more than recommended     More than recommended 
                     4375                      9210                      1514 
             Not answered 
                       63 

#### `smoking`

In [135]:
natsal$smoking <- factor(natsal$smoking, levels = c(1,2,3,4,9), labels = c("Non-smoker", "Ex-smoker", 
                                                                                       "Light smoker", "Heavy smoker", "Not answered"))

In [136]:
table(natsal$smoking)


  Non-smoker    Ex-smoker Light smoker Heavy smoker Not answered 
        7650         3282         2759         1464            7 

### Drugs

#### `drcannabis`

In [137]:
natsal$drcannabis <- factor(natsal$drcannabis, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [138]:
table(natsal$drcannabis)


            No            Yes Not applicable   Not answered 
          9723           4750            292            397 

#### `drampheta`

In [139]:
natsal$drampheta <- factor(natsal$drampheta, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [140]:
table(natsal$drampheta)


            No            Yes Not applicable   Not answered 
         13142           1331            292            397 

#### `drcocaine`

In [141]:
natsal$drcocaine <- factor(natsal$drcocaine, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [142]:
table(natsal$drcocaine)


            No            Yes Not applicable   Not answered 
         12748           1725            292            397 

#### `drcrack`

In [143]:
natsal$drcrack <- factor(natsal$drcrack, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [144]:
table(natsal$drcrack)


            No            Yes Not applicable   Not answered 
         14300            173            292            397 

#### `drecstasy`

In [145]:
natsal$drecstasy <- factor(natsal$drecstasy, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [146]:
table(natsal$drecstasy)


            No            Yes Not applicable   Not answered 
         13028           1445            292            397 

#### `drnonihero`

In [147]:
natsal$drnonihero <- factor(natsal$drnonihero, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [148]:
table(natsal$drnonihero)


            No            Yes Not applicable   Not answered 
         14337            136            292            397 

#### `dracidlsd`

In [149]:
natsal$dracidlsd <- factor(natsal$dracidlsd, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [150]:
table(natsal$dracidlsd)


            No            Yes Not applicable   Not answered 
         13648            825            292            397 

#### `drcrysmeth`

In [151]:
natsal$drcrysmeth <- factor(natsal$drcrysmeth, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [152]:
table(natsal$drcrysmeth)


            No            Yes Not applicable   Not answered 
         14415             58            292            397 

#### `dramylnit`

In [153]:
natsal$dramylnit <- factor(natsal$dramylnit, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [154]:
table(natsal$dramylnit)


            No            Yes Not applicable   Not answered 
         13786            687            292            397 

#### `drothnonpre`

In [155]:
natsal$drothnonpre <- factor(natsal$drothnonpre, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [156]:
table(natsal$drothnonpre)


            No            Yes Not applicable   Not answered 
         14119            354            292            397 

#### `inject2`

In [157]:
natsal$inject2 <- factor(natsal$inject2, levels = c(1,2,-1,9), labels = c("Yes", "No", 
                                                                                            "Not applicable", "Not answered"))

In [158]:
table(natsal$inject2)


           Yes             No Not applicable   Not answered 
           118          14381            292            371 

#### `drugsyr2`

In [159]:
natsal$drugsyr2 <- factor(natsal$drugsyr2, levels = c(0,1,2,-1,9), labels = c("No", "Yes, cannabis only",
                                                                                          "Yes, drugs other than cannabis", 
                                                                                          "Not applicable", "Not answered"))

In [160]:
table(natsal$drugsyr2)


                            No             Yes, cannabis only 
                         12497                           1119 
Yes, drugs other than cannabis                 Not applicable 
                           857                            292 
                  Not answered 
                           397 

### Depression

#### `mscore`

In [161]:
natsal$mscore[natsal$mscore==-1 | natsal$mscore==9] <- NA

In [162]:
summary(natsal$mscore)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.9541  2.0000  6.0000     681 

#### `depscr`

In [163]:
natsal$depscr <- factor(natsal$depscr, levels = c(0,1), labels = c("Yes", "No"))

In [164]:
table(natsal$depscr)


  Yes    No 
12821  1660 

### Sexual attraction

#### `attscale`

In [165]:
natsal$attscale <- factor(natsal$attscale, levels = c(1,2,3,4,5,6), labels = c("Opposite sex only", 
                                                                                           "More often opposite sex, and at least once same sex",
                                                                                          "About equally often to opposite sex and same sex",
                                                                                          "More often same sex, and at least once opposite sex",
                                                                                          "Same sex only",
                                                                                          "Never felt sexually attracted to anyone at all"))

In [166]:
table(natsal$attscale)


                                  Opposite sex only 
                                              13405 
More often opposite sex, and at least once same sex 
                                               1186 
   About equally often to opposite sex and same sex 
                                                168 
More often same sex, and at least once opposite sex 
                                                150 
                                      Same sex only 
                                                104 
     Never felt sexually attracted to anyone at all 
                                                114 

#### `expscale`

In [167]:
natsal$expscale <- factor(natsal$expscale, levels = c(1,2,3,4,5,6), labels = c("Opposite sex only", 
                                                                                           "More often opposite sex, and at least once same sex",
                                                                                          "About equally often to opposite sex and same sex",
                                                                                          "More often same sex, and at least once opposite sex",
                                                                                          "Same sex only",
                                                                                          "Never felt sexually attracted to anyone at all"))

In [168]:
table(natsal$expscale)


                                  Opposite sex only 
                                              13218 
More often opposite sex, and at least once same sex 
                                               1296 
   About equally often to opposite sex and same sex 
                                                 82 
More often same sex, and at least once opposite sex 
                                                142 
                                      Same sex only 
                                                 72 
     Never felt sexually attracted to anyone at all 
                                                296 

### Number of partners

#### `hetlife`

In [169]:
natsal$hetlife[natsal$hetlife==-1 | natsal$hetlife>=9995] <- NA

In [170]:
summary(natsal$hetlife)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    2.00    5.00   10.32   10.00 3300.00     604 

#### `samlife`

In [171]:
natsal$samlife[natsal$samlife==-1 | natsal$samlife>=9995] <- NA

In [172]:
summary(natsal$samlife)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
   0.0000    0.0000    0.0000    0.6564    0.0000 1000.0000        56 

### Marital and Relationship status

#### `marstat`

In [173]:
natsal$marstat <- factor(natsal$marstat, levels = c(1,2,3,4,5,6,7,8,9), labels = c("Single & never married", 
                                                                                           "Married & living with spouse",
                                                                                          "In registered same-sex civil partnership & living with partner",
                                                                                          "Separated but still legally married",
                                                                                          "Divorced",
                                                                                          "Widowed",
                                                                                              "Separated but still legally in same-sex civil partnership",
                                                                                              "Formerly a same-sex civil partner but now legally dissolved",
                                                                                              "Surviving civil partner, partner having died"))

In [174]:
table(natsal$marstat)


                                        Single & never married 
                                                          7348 
                                  Married & living with spouse 
                                                          5298 
In registered same-sex civil partnership & living with partner 
                                                            48 
                           Separated but still legally married 
                                                           445 
                                                      Divorced 
                                                          1424 
                                                       Widowed 
                                                           554 
     Separated but still legally in same-sex civil partnership 
                                                             2 
   Formerly a same-sex civil partner but now legally dissolved 
                                       