<p align="right"><i>Data Analysis for the Social Sciences - Part II - 2022/23</i></p>

# Quantitative Data Analysis

The purpose of this notebook is to demonstrate how to prepare the *National Survey of Sexual Attitudes and Lifestyles, 2010-2012: Teaching Dataset* for analysis in RStudio. Specifically it demonstrates how to:
* Handle missing values 
* Create new variables
* Label the values of categorical variables

### Importing data

The first step is to import the *Natsal-3* data.

In [8]:
natsal <- read.table("C:/Users/77901764/Dropbox/uws/teaching/dass/datasets/natsal/UKDA-8735-tab/tab/natsal_3_teaching.tab", 
                     header=TRUE, strip.white = TRUE, stringsAsFactors = FALSE,
                     na.strings = c("NA", ""), sep="\t")
head(natsal) # view the first six observations

Unnamed: 0_level_0,sin2,dateyoi,total_wt,psu_scrm,strata,stratagrp,stratagrp2,stratagrp3,dage,rdoby,⋯,netacc,adj_imd_quintile,qimd,qwimd,qsimd,tenure,livehere,gor_l,urindew,urindsc
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,9110103,2010,0.4605145,1636,763,332,221,166,25,1985,⋯,1,3,-1,2,-1,4,2,12,5,9
2,9110105,2010,1.7889542,1636,763,332,221,166,30,1980,⋯,1,4,-1,4,-1,4,2,12,5,9
3,9110109,2010,1.3815434,1636,763,332,221,166,29,1981,⋯,1,3,-1,2,-1,4,2,12,5,9
4,9110110,2010,0.7390113,1636,763,332,221,166,27,1983,⋯,2,4,-1,4,-1,4,2,12,5,9
5,9110112,2010,1.9239441,1636,763,332,221,166,41,1969,⋯,1,3,-1,2,-1,2,2,12,5,9
6,9110115,2010,3.2954182,1636,763,332,221,166,43,1967,⋯,1,2,-1,1,-1,2,1,12,5,9


Let's get a list of variable names.

In [2]:
names(natsal)

### Drop unnecessary variables

We do not need survey methodology variables or other miscellaneous variables in the final dataset.

### Missing values

It is important that missing values are clearly identified in datasets: it is not good practice to simply leave a cell blank in a spreadsheet for example. That is why you will see specific codes used to represent missing values in social surveys. 

For instance, consider the `dage1ch` variable in the *Natsal* dataset, which captures the age at which an individual had their first child. Clearly this question is not relevant to people without a child, and it is also plausible that individuals will not want to answer this question for a number of reasons. Therefore we need a consistent and sensible way of identifying the individuals that did not provide / have information for this variable.

In [14]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    18.0    15.3    25.0    99.0 

Note the values `-1` and `99`: after consulting the [codebook](./codebook/8735_natsal_teaching_codebook_v1.pdf) we see that these values represent the "Not applicable" and "Not answered" respectively.

You may wonder why I'm labouring this point. Well the crux of the matter is this: while we know that `-1` and `99` represent missing values, *R* does not! Therefore these values are included in any analyses we perform using this variable i.e., the median age is 18. Watch what happens when we tell *R* how to handle missing values:

In [16]:
natsal$dage1ch[natsal$dage1ch==-1 | natsal$dage1ch==99] <- NA # convert "-1" and "99" to missing

In [18]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

The median age (and the other summaries) are now more accurate as these exclude the invalid values.

In *R* missing values are recorded as `NA`.

### Creating variables

Often we may want to create new variables so that we do not overwrite existing variables. Or we may want to create a new variable that is a derivation of an existing variable e.g., age groups based on specific age values.

#### Creating a copy of an existing variable

In [19]:
natsal$dage1ch_copy <- natsal$dage1ch

In [20]:
summary(natsal$dage1ch_copy)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

In [21]:
summary(natsal$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.00   21.00   24.00   24.88   28.00   40.00    7361 

#### Create a derived variable

In [22]:
natsal$dage1ch_grp <- natsal$dage1ch

In [None]:
natsal$dage1ch_grp[]

### Labelling variables

This dataset is in good shape as it has been prepared with teaching in mind. However one way it can be improved is by labelling values of the categorical variables.

There are two ways that categorical variables can be labelled:
* Including missing values
* Excluding missing values

Let's say we're interested in the `religimp` variable. The values of this variable are stored as numbers like so:

In [3]:
table(natsal$religimp)


   1    2    3    4    9 
2163 3773 4414 4755   57 

While the [codebook](./codebook/8735_natsal_teaching_codebook_v1.pdf) tells us what categories these numbers represent, it would be more efficient and legible if we attached labels to these numbers in *R*.

#### Including missing values

In [9]:
natsal$religimp_miss <- factor(natsal$religimp, levels = c(1,2,3,4, 9), labels = c("Very important", "Fairly important", 
                                                                                         "Not very important", "Not important at all",
                                                                                          "Not answered"))

In [10]:
table(natsal$religimp_miss)


      Very important     Fairly important   Not very important 
                2163                 3773                 4414 
Not important at all         Not answered 
                4755                   57 

In general it is better to present analyses that **exclude** missing values (e.g., "Not answered" / "Not applicable" etc), while making a note of how many missing values there are in case you want to report this when writing up.

#### Excluding missing values

In [12]:
natsal$religimp <- factor(natsal$religimp, levels = c(1,2,3,4), labels = c("Very important", "Fairly important", 
                                                                                         "Not very important", "Not important at all"))

In [13]:
table(natsal$religimp)


      Very important     Fairly important   Not very important 
                2163                 3773                 4414 
Not important at all 
                4755 

#### `agrp`

In [None]:
natsal_clean$agrp <- factor(natsal_clean$agrp, levels = c(1,2,3,4,5,6), labels = c("16-24", "25-34", "35-44", 
                                                                                   "45-54", "55-64", "65-74"))

In [None]:
table(natsal_clean$agrp)

#### `rsex`

In [None]:
natsal_clean$rsex <- factor(natsal_clean$rsex, levels = c(1,2), labels = c("Male", "Female"))

In [None]:
table(natsal_clean$rsex)

#### `ethnicgrp`

In [None]:
natsal_clean$ethnicgrp <- factor(natsal_clean$ethnicgrp, levels = c(1,2,9), labels = c("White", "Non-white", "Not answered"))

In [None]:
table(natsal_clean$ethnicgrp)

#### `sexid`

In [None]:
natsal_clean$sexid <- factor(natsal_clean$sexid, levels = c(1,2,3,4,9), labels = c("Heterosexual/straight", "Gay/lesbian", 
                                                                                   "Bisexual", "Other", "Not answered"))

In [None]:
table(natsal_clean$sexid)

### Health and Disability

#### `health`

In [None]:
natsal_clean$health <- factor(natsal_clean$health, levels = c(1,2,3,4,5,9), labels = c("Very good", "Good", 
                                                                                   "Fair", "Bad", "Very bad", "Not answered"))

In [None]:
table(natsal_clean$health)

#### `disabil2`

In [None]:
natsal_clean$disabil2 <- factor(natsal_clean$disabil2, levels = c(1,2,3,9), labels = c("None", "Non-limiting", 
                                                                                   "Limiting", "Not answered"))

In [None]:
table(natsal_clean$disabil2)

### Alcohol and smoking

#### `drink`

In [None]:
natsal_clean$drink <- factor(natsal_clean$drink, levels = c(1,2,9), labels = c("Yes", "No", "Not answered"))

In [None]:
table(natsal_clean$drink)

#### `alcohol2`

In [None]:
natsal_clean$alcohol2 <- factor(natsal_clean$alcohol2, levels = c(0,1,2,9), labels = c("None", "Not more than recommended", 
                                                                                       "More than recommended", "Not answered"))

In [None]:
table(natsal_clean$alcohol2)

#### `smoking`

In [None]:
natsal_clean$smoking <- factor(natsal_clean$smoking, levels = c(1,2,3,4,9), labels = c("Non-smoker", "Ex-smoker", 
                                                                                       "Light smoker", "Heavy smoker", "Not answered"))

In [None]:
table(natsal_clean$smoking)

### Drugs

#### `drcannabis`

In [None]:
natsal_clean$drcannabis <- factor(natsal_clean$drcannabis, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drcannabis)

#### `drampheta`

In [None]:
natsal_clean$drampheta <- factor(natsal_clean$drampheta, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drampheta)

#### `drcocaine`

In [None]:
natsal_clean$drcocaine <- factor(natsal_clean$drcocaine, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drcocaine)

#### `drcrack`

In [None]:
natsal_clean$drcrack <- factor(natsal_clean$drcrack, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drcrack)

#### `drecstasy`

In [None]:
natsal_clean$drecstasy <- factor(natsal_clean$drecstasy, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drecstasy)

#### `drnonihero`

In [None]:
natsal_clean$drnonihero <- factor(natsal_clean$drnonihero, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drnonihero)

#### `dracidlsd`

In [None]:
natsal_clean$dracidlsd <- factor(natsal_clean$dracidlsd, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$dracidlsd)

#### `drcrysmeth`

In [None]:
natsal_clean$drcrysmeth <- factor(natsal_clean$drcrysmeth, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drcrysmeth)

#### `dramylnit`

In [None]:
natsal_clean$dramylnit <- factor(natsal_clean$dramylnit, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$dramylnit)

#### `drothnonpre`

In [None]:
natsal_clean$drothnonpre <- factor(natsal_clean$drothnonpre, levels = c(0,1,-1,9), labels = c("No", "Yes", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drothnonpre)

#### `inject2`

In [None]:
natsal_clean$inject2 <- factor(natsal_clean$inject2, levels = c(1,2,-1,9), labels = c("Yes", "No", 
                                                                                            "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$inject2)

#### `drugsyr2`

In [None]:
natsal_clean$drugsyr2 <- factor(natsal_clean$drugsyr2, levels = c(0,1,2,-1,9), labels = c("No", "Yes, cannabis only",
                                                                                          "Yes, drugs other than cannabis", 
                                                                                          "Not applicable", "Not answered"))

In [None]:
table(natsal_clean$drugsyr2)

### Depression

#### `mscore`

In [11]:
natsal_clean$mscore[natsal_clean$mscore==-1 | natsal_clean$mscore==9] <- NA

In [12]:
summary(natsal_clean$mscore)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.9541  2.0000  6.0000     681 

#### `depscr`

In [13]:
natsal_clean$depscr <- factor(natsal_clean$depscr, levels = c(0,1), labels = c("Yes", "No"))

In [14]:
table(natsal_clean$depscr)


  Yes    No 
12821  1660 

### Sexual attraction

#### `attscale`

In [15]:
natsal_clean$attscale <- factor(natsal_clean$attscale, levels = c(1,2,3,4,5,6), labels = c("Opposite sex only", 
                                                                                           "More often opposite sex, and at least once same sex",
                                                                                          "About equally often to opposite sex and same sex",
                                                                                          "More often same sex, and at least once opposite sex",
                                                                                          "Same sex only",
                                                                                          "Never felt sexually attracted to anyone at all"))

In [16]:
table(natsal_clean$attscale)


                                  Opposite sex only 
                                              13405 
More often opposite sex, and at least once same sex 
                                               1186 
   About equally often to opposite sex and same sex 
                                                168 
More often same sex, and at least once opposite sex 
                                                150 
                                      Same sex only 
                                                104 
     Never felt sexually attracted to anyone at all 
                                                114 

#### `expscale`

In [17]:
natsal_clean$expscale <- factor(natsal_clean$expscale, levels = c(1,2,3,4,5,6), labels = c("Opposite sex only", 
                                                                                           "More often opposite sex, and at least once same sex",
                                                                                          "About equally often to opposite sex and same sex",
                                                                                          "More often same sex, and at least once opposite sex",
                                                                                          "Same sex only",
                                                                                          "Never felt sexually attracted to anyone at all"))

In [18]:
table(natsal_clean$expscale)


                                  Opposite sex only 
                                              13218 
More often opposite sex, and at least once same sex 
                                               1296 
   About equally often to opposite sex and same sex 
                                                 82 
More often same sex, and at least once opposite sex 
                                                142 
                                      Same sex only 
                                                 72 
     Never felt sexually attracted to anyone at all 
                                                296 

### Number of partners

#### `hetlife`

In [21]:
natsal_clean$hetlife[natsal_clean$hetlife==-1 | natsal_clean$hetlife>=9995] <- NA

In [22]:
summary(natsal_clean$hetlife)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    2.00    5.00   10.32   10.00 3300.00     604 

#### `samlife`

In [23]:
natsal_clean$samlife[natsal_clean$samlife==-1 | natsal_clean$samlife>=9995] <- NA

In [24]:
summary(natsal_clean$samlife)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
   0.0000    0.0000    0.0000    0.6564    0.0000 1000.0000        56 

### Marital and Relationship status

#### `marstat`

In [25]:
natsal_clean$marstat <- factor(natsal_clean$marstat, levels = c(1,2,3,4,5,6,7,8,9), labels = c("Single & never married", 
                                                                                           "Married & living with spouse",
                                                                                          "In registered same-sex civil partnership & living with partner",
                                                                                          "Separated but still legally married",
                                                                                          "Divorced",
                                                                                          "Widowed",
                                                                                              "Separated but still legally in same-sex civil partnership",
                                                                                              "Formerly a same-sex civil partner but now legally dissolved",
                                                                                              "Surviving civil partner, partner having died"))

In [26]:
table(natsal_clean$marstat)


                                        Single & never married 
                                                          7348 
                                  Married & living with spouse 
                                                          5298 
In registered same-sex civil partnership & living with partner 
                                                            48 
                           Separated but still legally married 
                                                           445 
                                                      Divorced 
                                                          1424 
                                                       Widowed 
                                                           554 
     Separated but still legally in same-sex civil partnership 
                                                             2 
   Formerly a same-sex civil partner but now legally dissolved 
                                       