In [2]:
# Call out to HTML & javascript to set our look and feel. We need libraries IRdisplay and js for this.
library("IRdisplay")

ChangeDisplaySettings<-TRUE
if (ChangeDisplaySettings == TRUE) {

    # This command will change the size of R plots. Adjust width and height to suit.
    options(repr.plot.width=8,repr.plot.height=6)

    # The following changes font size and colour for notebook pages

    display_html("
        <style>
            body {background-color: grey;
                  color: black;
                  font-family: Calibri, sans-serif;
                  font-size: 100%;
            }
            h1 {color: black;
                font-size: 200%;
            }
            h2 {color: black;
                font-size: 150%;
            }
            h3 {color: black;
                font-size: 100%;
            }
            p {padding: 10px 0px 10px;
               text-align: justify;
            }
            li {line-height: 100%;
                padding-top: 1%;
                padding-bottom: 1%;
                text-align: justify;
            }
            strong {font-weight: bold;
            }
            /* types of lists */
            ul.nobull {
                 list-style-type: none;
            }
            ol.i {
                list-style-type: lower-roman;
            }
            ol.a {
                list-style-type: lower-alpha;
            }
            ol.A {
                list-style-type: upper-alpha;
            }
            /* this is to make text justified in paragraphs and lists*/
            .text_cell_render p {
                text-align: justify;
                text-justify: inter-word;
            }
            .text_cell_render li {
                text-align: justify;
                text-justify: inter-word;
            }
            .rendered_html table, .rendered_html td, .rendered_html th {font-size: 100%;
            }
            .container {
                width: 80% !important;
            }
        </style>
    ")
}

<html>
<head>
    <h2>University of Stirling</h2>
    <h2>Computing Science and Mathematics</h2>
    <h2>MATPMD1 Statistics for Data Science</h2>
    <h1>Chapter 2 Data</h1>
</head>

<body>
    <p>Observations or data are the raw materials with which statisticians work. 
    </p>
    <p>For most statistics to be applicable these observations must be in the form of numbers or be able to be converted into numbers (we'll cover that process later).
    </p>
</body>

<body>
    <p>Data examples are:
    <ul>
        <li>crop yield;</li>
        <li>time to recovery after cardiac surgery;</li>
        <li>number of defective items in an assembly process;</li>
        <li>birth rate in Scotland;</li>
        <li>market share of different washing powders;</li>
        <li>political preferences.</li>
    </ul>
</body>

<body>
    <p>When confronted with data we initially ask many questions such as:</p>
    <ul>
        <li>Why were these data collected?</li>
        <li>Who collected these data?</li>
        <li>What methods did they use to collect the data?</li>
        <li>Has randomisation been used?</li>
        <li>Are there any sources of bias?</li>
    </ul>
</body>

<body>
    <ul>
        <li>What is the target population for which inference is desired?</li>
        <li>Is an informal / descriptive / exploratory analysis sufficient?</li>
        <li>Do we have all the relevant data?</li>
        <li>Has some been concealed from us?</li>
        <li>Has the investigator who collected the data removed any that didn't quite fit in?</li>
    </ul>
</body>

<body>
    <p>We now classify the different types of data and this is a most important step in any analysis. Usually the type of analysis to be performed depends on the type of data being considered.
    </p>
    <p>All formal techniques have assumptions of various kinds to do with the type and structure of the data. Hence in order to use a technique we must check that the assumptions are &#39;reasonably&#39; valid.
    </p>
</body>

<body>
    <h2>2.1 Data Types</h2>
    <p>We say that our data consists of observations which are values of variables. Examples of variables include:
    </p>
    <table>
    <thead>
      <tr>
        <th>Variable</th>
        <th>Value</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of births in a country</td>
        <td>coded 0, 1, 2, 3, ....</td>
      </tr>
      <tr>
        <td>Geographical region</td>
        <td>North Scotland, West Scotland,...</td>
      </tr>
      <tr>
        <td>Weekly earnings</td>
        <td>From £80 upwards</td>
      </tr>
      <tr>
        <td>Dose of chemical</td>
        <td>0 upwards</td>
      </tr>
      <tr>
        <td>Smoking behaviour</td>
        <td>Non smoker, Ex smoker, Occasional smoker, Heavy smoker</td>
      </tr>
    </tbody>
    </table>
</body>

<body>
    <p>There are two main types of data :
    </p>
    <ul>
        <li>Categorical</li>
        <li>Quantitative</li>
    </ul>
</body>    

<body>
    <h2>2.2 Categorical Variables</h2>
    <p>Examples of Categorical Variables are :
    </p>
    <table>
    <thead>
      <tr>
        <th>Categorical Variable</th>
        <th>Value</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Smoking Behaviour</td>
        <td>Non smoker, Ex smoker, Occasional smoker, Heavy smoker</td>
      </tr>
      <tr>
        <td>Scottish Electoral Registration Area</td>
        <td>Borders, Lothian, Central Scotland,...</td>
      </tr>
      <tr>
        <td>Hair Colour</td>
        <td>Brown, Blonde, ...</td>
      </tr>
      <tr>
        <td>Eye Colour</td>
        <td>Blue, Green, Brown, ...</td>
      </tr>
    </tbody>
    </table>
</body>

<body>
    <p>There are two types of Categorical Variable:
    </p>
    <ul>
        <li>Ordinal</li>
        <li>Nominal</li>
    </ul>
</body>

<body>
    <h3>2.2.1 Ordinal Categorical Variables
    </h3>
    <p>With an <strong>Ordinal</strong> variable the categories the variable can take can be ordered. For example:
    </p>
    <table>
    <thead>
      <tr>
        <th>Categorical Variable</th>
        <th>Value</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Smoking</td>
        <td>Non smoker, Ex smoker, Occasional smoker, Heavy smoker</td>
      </tr>
      <tr>
        <td>Experience of Statistical packages</td>
        <td>None, A little, Some, A lot</td>
      </tr>
      <tr>
        <td>Strength of agreement with a statement</td>
        <td>Disagree strongly, Disagree mildly, Neutral, Agree Mildly, Agree Strongly</td>
      </tr>
    </tbody>
    </table>
</body>

<body>
    <h3>2.2.2 Nominal Categorical Variables
    </h3>
    <p>With <strong>Nominal</strong> variables, the categories are simply labels, that is one category cannot be ranked as greater than or less than other categories. For example:
    </p>
    <ul>
        <li>male and female are different categories but we cannot say that greater than or better than the other
        </li>
        <li>for eye colour we cannot say that green eyes are greater than brown eyes. 
        </li>
    </ul>
</body>

<body>
    <p>The categories are often conveniently designated by numbers but these numbers do not imply any relationship between individual members.
    </p>
</body>

<body>
    <h3>2.3 Quantitative Variables</h3>
    <p>For quantitative variables we want to quantify something by counting or measuring. For example:
    </p>
    <table>
    <thead>
      <tr>
        <th>Quantitative Variable</th>
        <th>Value</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of live births</td>
        <td>1,2,3,...</td>
      </tr>
      <tr>
        <td>Weekly earnings</td>
        <td>£120+</td>
      </tr>
      <tr>
        <td>Weight</td>
        <td>40kg+</td>
      </tr>
      <tr>
        <td>Expenditure</td>
        <td>£25,000</td>
      </tr>
    </tbody>
    </table>
</body>

<body>
    <p><strong>Note</strong> There is a rather subtle distinction between two types of such quantitative variables into:
    </p>
    <ul>
        <li>interval - scale of equal intervals, addition and subtraction possible.</li>
        <li>ratio - additionally the scale starts at true zero.</li>
    </ul>
    <p>However, this is just a distinction to be aware of, we shall ignore this.
    </p>
</body>

<body>
    <h3>2.3.1 Discrete and Continuous variables
    </h3>
    <p>The two major sub-groupings of quantitative data are
    <ul>
        <li>Discrete</li>
        <li>Continuous</li>
    </ul>
</body>

<body>
    <p><strong>Discrete</strong> variables are any kind of count, they take values in a restricted set such as integers {0, 1, 2... }. For example: the number of defective items.
    </p>
    <p><strong>Continuous</strong> variables are any kind of measurement and may take any real number. For example: height, weight and age.
    </p>
</body>

<body>
    <h2>2.4 Summary of Data types
    </h2>
    <p>In summary:
    </p>
    <img src=MATPMD1Chapter2Fig1.jpg  style="width:75%" >
</body>