# Univariate analysis

In this assignment I will start data analysis for the research project described here:

[Research proposal](https://github.com/CourseraParticipant/Data-Management-and-Visualization/blob/master/Research%20proposal-checkpoint.ipynb)

The univariate analysis will be done in few steps, and the code in Python will be thus splitted into blocks. Let me start by loading the data set NESARC.

In [1]:
# Step 1: Loading data which is saved in the working directory as csv file under name "nesarc.csv"
%matplotlib inline  
# specific for jupyter notebook: for graphs to be displayed inline
# Importing libraries
import pandas
import numpy
# load data set
data = pandas.read_csv("nesarc.csv",low_memory = False)
# Checking how many observations the dataset contains
print("Total number of rows in the data source")
print(len(data))
#Checking how many variables/columns are traked with tha data
print("Total number of variables present in the original data")
print(len(data.columns))

Total number of rows in the data source
43093
Total number of variables present in the original data
3008


The next step is extract a subset from  the original data, which includes only observations relevant for my research question. Namely, according to my research focus, I am interesting only into subjects born in the U.S.A. As the codebook states, the variable "S1Q1F" keeps track on the question whether a subject was /was not burn in the United States.
Before extracting the subset of data, let us now look at the distribution of this variable.

In [2]:
# Step 2: Extracting only needed data
# Firstly, since Python is case-sensitive I shall change the names of all columns to uppercase
data.columns = map(str.upper,data.columns)
# 1st univariate analysis: How many subjects in data are born in the U.S.?
print("Counts of U.S. born / non-born data subjects: 1 - born in the U.S., 2 - not born in the U.S., 9 - Unknown")
c1 = data['S1Q1F'].value_counts(sort = False,dropna = False)
print(c1)
# Now the same counts but in terms of frequencies/percentages
print("Frequency of U.S. born / non-born data subject: 1 - born in the U.S.,2 - not born in the U.S., 9 - Unknown")
p1 = data['S1Q1F'].value_counts(sort = False,dropna = False, normalize = True)
print(p1)

Counts of U.S. born / non-born data subjects: 1 - born in the U.S., 2 - not born in the U.S., 9 - Unknown
1    35622
2     7320
9      151
Name: S1Q1F, dtype: int64
Frequency of U.S. born / non-born data subject: 1 - born in the U.S.,2 - not born in the U.S., 9 - Unknown
1    0.826631
2    0.169865
9    0.003504
Name: S1Q1F, dtype: float64


 **Summary for the first frequency check -  U.S. born subjects** : Here we see that 35622 or ca. 82,7% of the study subjects were born in the U.S., 7320 subjects were not born in the U.S. whereas for 151 individuals the data does not contain information about birth of place.
 
 Since I aim to investigate only  U.S. born subjects, I may conclude that my subset will be still sufficiently large (as it contains 35622 or 82.7 % of the provided data).
 Let me know create my data set "myData" by selecting  only subjects born in the U.S.

In [4]:
# Making my own data containing only subjects who were born in the U.S.
sub1 = data[data['S1Q1F']==1]
myData = sub1.copy()
# Double-check : Hereby I want to check whether myData really (quantitatively) selected target group. I will check the 
# number of observations in myData and compare it with the above obtained counts. According to the counts,there should be
# 35622 rows in the subset
print ('Double-check for the subset of data: Total number of rows in my selected data')
print (len(myData))

Double-check for the subset of data: Total number of rows in my selected data
35622


 So, the dataset myData now contains only rows I am interested in. As my research question is about the potential relationship between employment status, occupation type and hospital stays, these three variables will be analyzed.
 
 Let me start with the employment status in the last 12 months, which is represented by variable ' S1Q8A'.

In [7]:
# Step 3 : Univariate analysis on the subset
# The 1st variable - Employment status in the last 12 months, represented by 'S1Q8A' with codes: 1 - Yes, 2 - No
# Counts of employed/unemployed subjects (in last 12 moths)
print('Counts for employed/unemployed subjects who were born in the U.S.: 1 - employed, 2 - unemployed')
c2=myData['S1Q8A'].value_counts(sort = False, dropna = False)
print(c2)
print('Associated percentages for employed/unemployed subjects who were born in the U.S.: 1- employed, 2-unemployed')
p2=myData['S1Q8A'].value_counts(sort = False, dropna = False, normalize = True)
print(p2)

Counts for employed/unemployed subjects who were born in the U.S.: 1 - employed, 2 - unemployed
1    25301
2    10321
Name: S1Q8A, dtype: int64
Associated percentages for employed/unemployed subjects who were born in the U.S.: 1- employed, 2-unemployed
1    0.710263
2    0.289737
Name: S1Q8A, dtype: float64


 **Summary for the second frequency check - Employment status in the last 12 months of  U.S. born subjects ** : Here we see that 25301 out of 35622, or approximately 71% of the U.S. born subjects, have been employed in the last 12 months, whereas 10321 U.S. born subjects or ca. 29 % were unemployed in the same time period. 
 
 Let me now look at the second variable of the interest: the occupation type, which is given by the variable 'S1Q9B'. 

In [9]:
# The second variable - Occupation type represented by 'S1Q9B' 
# Counts of occupation types
print('Counts for occupation types of  subjects who were born in the U.S.:')
c3=myData['S1Q9B'].value_counts(sort = True, dropna = False)
print(c3)
# I set sort = TRUE as codes themselves have no meaning as numbers. And by sort = true, I can see which
# occupation type occurs the most frequent, which the least  etc.
print('Associated percentages for occupation categories of subjects who were born in the U.S.: ')
#frequencies of the occupations
p3=myData['S1Q9B'].value_counts(sort = True, dropna = False, normalize = True)
print(p3)
print ('Codes for the occupation type:')
print ('1 - Executive, Administrative, and Managerial')
print('2 - Professional Speciality')
print('3 - Technical and Related Support')
print('4 - Sales')
print('5 - Administrative Support, including Clerical')
print(' 6 - Private Household')
print ('7 - Protective Services')
print ('8 - Other Services')
print(' 9 - Farming, Forestry and Fishing')
print('10 - Precision Production, Craft and Repair')
print('11 - Operators, Fabricators and Laborers')
print('12 - Transportation and Material Moving')
print('13 - Handlers, Equipment Cleaners and Laborers')
print('14 - Military')
print ('BL - NA, never worked for pay or in family business or farm')

Counts for occupation types of  subjects who were born in the U.S.:
      7308
2     5167
1     4200
8     4097
5     2902
4     2883
3     2506
11    2137
10    1059
13     886
12     839
9      525
7      492
6      325
14     296
Name: S1Q9B, dtype: int64
Associated percentages for occupation categories of subjects who were born in the U.S.: 
      0.205154
2     0.145051
1     0.117905
8     0.115013
5     0.081467
4     0.080933
3     0.070350
11    0.059991
10    0.029729
13    0.024872
12    0.023553
9     0.014738
7     0.013812
6     0.009124
14    0.008309
Name: S1Q9B, dtype: float64
Codes for the occupation type:
1 - Executive, Administrative, and Managerial
2 - Professional Speciality
3 - Technical and Related Support
4 - Sales
5 - Administrative Support, including Clerical
 6 - Private Household
7 - Protective Services
8 - Other Services
 9 - Farming, Forestry and Fishing
10 - Precision Production, Craft and Repair
11 - Operators, Fabricators and Laborers
12 - Transportati

 **Summary for the third frequency check - Occupation type of  U.S. born subjects ** : Here we see the most of subjects (7308 persons or 20,51% of U.S. born subjects) did not answer the occupation type question, data e missing, never worked for pay or in family business or farm. It is interesting to compare this number with the number of unemployed subject which was 10 321. The logic says that (unless the subjects answered inconsistently regarding questions about their employment status and occupation type ) at least 10 321 - 7308 = 3013 persons were not employed in the last 12 moths, but worked / are educated for specific occupation. 
 
 The second largest occupation category is "professional speciality" (5167 persons or ca. 14,5%), followed by " Executive, Administrative, and Managerial" (11,8% or 4200 subjects) and then "Other services" (11.5% or 4097 people). The least popular occupation type among subjects, according to the data, is "Military" with only 296 subjects. The second smallest category is "private household" group with 325 persons.
 
 Let me now look at the last variable of the interest: the hospital stays, which is given by the variable 'S13Q1'. 

In [11]:
# The third variable -  Number of times stayed in hospital in last 12 months 
# (excluding delivery of healty liveborn infant)
# Variable name: 'S13Q1', codes: 0-98 number of times, 99 - unknown
# As the codes are numbers which indeed make sense, I shall firstly convert the variable to numeric type
myData['S13Q1']=pandas.to_numeric(myData['S13Q1'])
# Counts of overnight hospital stays
print('Counts of hospital stays overnight (last 12 months): 0-98 number of times, 99 - unknown')
c4=myData['S13Q1'].value_counts(sort = True, dropna = False)
print(c4)
# associated frequencies
print('Associated percentages for hospital stays overnight (last 12 months): 0-98 number of times, 99 - unknown')
p4=myData['S13Q1'].value_counts(sort = True, dropna = False, normalize = True)
print(p4)

Counts of hospital stays overnight (last 12 months): 0-98 number of times, 99 - unknown
0     30846
1      2673
99      873
2       736
3       249
4        95
5        62
6        34
7        11
10       10
12        7
8         6
30        3
14        3
9         3
15        2
18        2
60        1
11        1
16        1
35        1
20        1
24        1
90        1
Name: S13Q1, dtype: int64
Associated percentages for hospital stays overnight (last 12 months): 0-98 number of times, 99 - unknown
0     0.865926
1     0.075038
99    0.024507
2     0.020661
3     0.006990
4     0.002667
5     0.001740
6     0.000954
7     0.000309
10    0.000281
12    0.000197
8     0.000168
30    0.000084
14    0.000084
9     0.000084
15    0.000056
18    0.000056
60    0.000028
11    0.000028
16    0.000028
35    0.000028
20    0.000028
24    0.000028
90    0.000028
Name: S13Q1, dtype: float64


** Summary for the fourth frequency check - Overnight hospital stays (in last 12 months) **:  The last univariate analysis says that the most of subjects did not stay overnight in the hospital within last 12 moths. More precisely,  30846 or almost 86.6% of all U.S. born citizens in the data set did not stay in the hospital, followed by pretty linearly with the next minimal number of times ( 1,2 ..and so on). Less then 1% of subjects has stayed exactly x times in the hospital, where x might be any number greater than 3. There are mostly unique cases ( or up to 7) who stayed more than 10 times in the hospital.

Missing data: There were 873 subjects or almost 2.5% of all U.S. born subjects on which data is missing for this specific question.

# Summary

By investigating only U.S. born subjects, I approximately exclude 17.3% of all data provided in NESARC and analyze data on 35622 subjects.
71.9% of those 35622 subjects,  are /were employed within last 12 months. There is no information or data are missing on occupation type for circa 20% of all U.S. born persons from the data (irrespective of employment status). The most popular job seems to be categorized as "professional speciality" and the least popular category is "military". Regarding the overnight stay in hospital,the big majority of subjects (30846 or 86.6%) did not stay in hospital in the last 12 months. 