## Data Types and  Structure- Basic manipulation


This document contains the codes, explanation and implementation of the **2 tasks** where I manipulate data presented in different types and structure. 



##### *To begin, I first load the necessary packages and libraries needed for the 2 tasks*.


In [2]:

import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype 
from pandas.api.types import union_categoricals
import datetime as dt 
from dateutil import parser, tz, relativedelta
import re




### Task 1

##### QUESTION

What is the most common cause of injury by a grouping categorical variable of your choosing?

To answer this question, below is a list of steps that I would use to analyse and answer the above question.

#### Steps in task 1

1. Reading the 2 dataset from my home directory to Pandas dataframe.

2. Joining the 2 dataset to a single dataframe  

3. Selecting the variables of interest for analysis into a another dataframe.

4. Checking the data type of the variables of interest and ensuring they are of the right type.

5. Checking the details of the new dataframe, including checks for missing data and data format, long or wide.

6. Transforming the **Answer** from long to wide if necessary.




#### Step 1

In this step, I read the 2 dataset from my working directory to jupyter, then I pass the dataset to a pandas data frame as shown in the code below 

In [3]:
# reading the data from the website and passing them through pandas dataframe
# and respectively storing the dataset as as scot_injuries and health_boards

stroke_injuries = pd.read_csv("https://www.opendata.nhs.scot/dataset/f5dcf382-e6ca-49f6-b807-4f9cc29555bc/resource/47656572-e196-40c8-83e8-08b0b223b2e6/download/stroke_activitybyhbr.csv")

health_boards = pd.read_csv("https://www.opendata.nhs.scot/dataset/9f942fdb-e59e-44f5-b534-d6e17229cc7b/resource/652ff726-e676-4a20-abda-435b98dd7bdc/download/hb14_hb19.csv")



In a first step, I explore the data by using to the first few rows and the shape of both dataset. More exploration on the details and types of the variables of interest will be done in **step 4**. For now we simply look at the first few rows, columns and shape of both dataset. 


In [4]:
# Simple glance into what the rows and columns of the dataset look like

scot_injuries.head(5)

Unnamed: 0,FinancialYear,HBR,HBRQF,CA,CAQF,AgeGroup,AgeGroupQF,Sex,SexQF,InjuryLocation,InjuryLocationQF,InjuryType,InjuryTypeQF,NumberOfAdmissions
0,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,All Diagnoses,d,53815
1,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,RTA,,3008
2,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,Poisoning,,2242
3,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,Falls,,33385
4,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,"Struck by, against",,2423


In [5]:
# Simple glance of the shape of the dataset. That is, the number of rows and columns in the dataset.

scot_injuries.shape

(391113, 14)

In [6]:
#Simple glance into what the rows and columns of the dataset look like

health_boards.head(5)

Unnamed: 0,_id,HB,HBName,HBDateEnacted,HBDateArchived,Country,CountryName
0,1,S08000015,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
1,2,S08000016,NHS Borders,20140401,,S92000003,Scotland
2,3,S08000017,NHS Dumfries and Galloway,20140401,,S92000003,Scotland
3,4,S08000018,NHS Fife,20140401,20180201.0,S92000003,Scotland
4,5,S08000019,NHS Forth Valley,20140401,,S92000003,Scotland


In [7]:
# Simple glance of the shape of the dataset. That is, the number of rows and columns in the dataset.

health_boards.shape

(18, 7)

For the dataset `health_boards` we see a total of 18 rows and 7 columns. Clearly, this second data frame has fewer data than the first data frame `(scot_injuries)`. There are several implication for the analysis. One implication is that this difference will determine the shape of our data frame after the join operations. Basically, the shape of the data frame will be determined by how I choose to join the 2 data frame.


#### Step 2

In this step, I would now join the 2 dataset into a single data frame. First, I perform an inner join operation. Also, since the **linkage key** do not have the same name, I rename the HBR column in `scot_injuries` to HB to ensure it has the same name as HB column the in `health_boards`, then I mearge the 2 data frame, to form a new data frame: `Data_inner`, that contain all the matching rows from both data frames. 

In [8]:
# Performing and inner join of both data frame, using the HBR and the HB columns to match the dataset.

Data_inner = pd.merge(scot_injuries.rename(columns={'HBR': 'HB'}), health_boards, on='HB', how='inner')

Data_inner

Unnamed: 0,FinancialYear,HB,HBRQF,CA,CAQF,AgeGroup,AgeGroupQF,Sex,SexQF,InjuryLocation,InjuryLocationQF,InjuryType,InjuryTypeQF,NumberOfAdmissions,_id,HBName,HBDateEnacted,HBDateArchived,Country,CountryName
0,2012/13,S08000015,,S12000008,,All,d,All,d,All,d,All Diagnoses,d,1503,1,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
1,2012/13,S08000015,,S12000008,,All,d,All,d,All,d,RTA,,63,1,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
2,2012/13,S08000015,,S12000008,,All,d,All,d,All,d,Poisoning,,67,1,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
3,2012/13,S08000015,,S12000008,,All,d,All,d,All,d,Falls,,938,1,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
4,2012/13,S08000015,,S12000008,,All,d,All,d,All,d,"Struck by, against",,101,1,NHS Ayrshire and Arran,20140401,,S92000003,Scotland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378958,2021/22,S08000032,,S12000029,,75plus years,,Female,,Undisclosed,,"Struck by, against",,2,18,NHS Lanarkshire,20190401,,S92000003,Scotland
378959,2021/22,S08000032,,S12000029,,75plus years,,Female,,Undisclosed,,Crushing,,0,18,NHS Lanarkshire,20190401,,S92000003,Scotland
378960,2021/22,S08000032,,S12000029,,75plus years,,Female,,Undisclosed,,Scalds,,0,18,NHS Lanarkshire,20190401,,S92000003,Scotland
378961,2021/22,S08000032,,S12000029,,75plus years,,Female,,Undisclosed,,Accidental Exposure,,13,18,NHS Lanarkshire,20190401,,S92000003,Scotland


In contrast to inner join, In the code below, I join the 2 dataset into a single data frame using an outer join operations. Specifically, I use a full join operation. This means I joing the 2 dataset to a single dataframe with any consideration to matching rows. One clear advantage of the full join over the inner join is that since we make no consideration to matching row, the full join ensures there is no data loss. 

**Note** Regardless of this merit of the full join over inner join, I choose to proceed with my analysis using the data from the inner join. My main reason for this is just for personal exploration since the full join was used in our previous `tutorial 5`. Doing something different this time is a great way for me to learn and explore.

In [9]:
# Outer join of both data frames into a single data frame.

Data_outer = pd.merge(scot_injuries.rename(columns={'HBR': 'HB'}), health_boards, on='HB', how='outer')

Data_outer

Unnamed: 0,FinancialYear,HB,HBRQF,CA,CAQF,AgeGroup,AgeGroupQF,Sex,SexQF,InjuryLocation,InjuryLocationQF,InjuryType,InjuryTypeQF,NumberOfAdmissions,_id,HBName,HBDateEnacted,HBDateArchived,Country,CountryName
0,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,All Diagnoses,d,53815.0,,,,,,
1,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,RTA,,3008.0,,,,,,
2,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,Poisoning,,2242.0,,,,,,
3,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,Falls,,33385.0,,,,,,
4,2012/13,S92000003,d,S92000003,d,All,d,All,d,All,d,"Struck by, against",,2423.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391112,2021/22,S08000032,,S12000029,,75plus years,,Female,,Undisclosed,,Other,,13.0,18.0,NHS Lanarkshire,20190401.0,,S92000003,Scotland
391113,,S08000018,,,,,,,,,,,,,4.0,NHS Fife,20140401.0,20180201.0,S92000003,Scotland
391114,,S08000021,,,,,,,,,,,,,7.0,NHS Greater Glasgow and Clyde,20140401.0,20190331.0,S92000003,Scotland
391115,,S08000023,,,,,,,,,,,,,9.0,NHS Lanarkshire,20140401.0,20190331.0,S92000003,Scotland


#### Step 3

In this step, I select my variables of interest. Specifically, I select `Age Group, Sex and Injury type`. The main reason for my selection is curiosity and to explore which injury is critical across different age profile and sex. **Recall** in step 2, I explain my reason for choosing inner join

In [10]:
# This operation selects the variable of interest, Age, sex and injury type into a new data frame called Data_DF

Data_DF = Data_inner.loc[:,["Sex", "AgeGroup", "InjuryType"]]

Data_DF

Unnamed: 0,Sex,AgeGroup,InjuryType
0,All,All,All Diagnoses
1,All,All,RTA
2,All,All,Poisoning
3,All,All,Falls
4,All,All,"Struck by, against"
...,...,...,...
378958,Female,75plus years,"Struck by, against"
378959,Female,75plus years,Crushing
378960,Female,75plus years,Scalds
378961,Female,75plus years,Accidental Exposure


#### Step 4
In this step, first, I check the data type of the variables of interest. As seen below, the data type for the 3 variable is object. Obviously this is not the right format since all 3 variables are categorical data type. To resolve this, in the next operation, I convert the variables to categorical data types.


In [11]:
#Checking the data types of the variables of interest

print(Data_DF.dtypes)

Sex           object
AgeGroup      object
InjuryType    object
dtype: object


In [12]:
# Coverting the variables of interest to a categorical data type
cat = ["Sex", "AgeGroup", "InjuryType"]
Data_DF[cat] = Data_DF[cat].astype("category")

# Checking the types and printing the shape of the variables.
print(Data_DF.dtypes)
print(Data_DF.shape)

Sex           category
AgeGroup      category
InjuryType    category
dtype: object
(378963, 3)


#### Step 5

Just like the beginning in step 1, in this step, I again looks at the details of the variable of interest, specifically, I use `.unique` and `isna()` to check the unique categories in the 3 categorical variables in the data. Notice that there are also that for all 3 variables, there is **no missing values** and all 3 variables are in the now in the right data types following the conversion to `categories` data type in the previous step. 

In [13]:
# check Sex 

print(Data_DF.Sex.unique()) # no aggregate responses 

print(Data_DF.Sex.isna().sum()) # no missing data 

['All', 'Male', 'Female']
Categories (3, object): ['All', 'Female', 'Male']
0


In [14]:
# check AgeGroup

print(Data_DF.AgeGroup.unique()) # no aggregate responses 

print(Data_DF.AgeGroup.isna().sum()) # no missing data 

['All', '0-4 years', '5-9 years', '10-14 years', '15-24 years', '25-44 years', '45-64 years', '65-74 years', '75plus years']
Categories (9, object): ['0-4 years', '10-14 years', '15-24 years', '25-44 years', ..., '5-9 years', '65-74 years', '75plus years', 'All']
0


In [15]:
# check InjuryTyoe

print(Data_DF.InjuryType.unique()) # no aggregate responses 

print(Data_DF.InjuryType.isna().sum()) # no missing data 

['All Diagnoses', 'RTA', 'Poisoning', 'Falls', 'Struck by, against', 'Crushing', 'Scalds', 'Accidental Exposure', 'Other']
Categories (9, object): ['Accidental Exposure', 'All Diagnoses', 'Crushing', 'Falls', ..., 'Poisoning', 'RTA', 'Scalds', 'Struck by, against']
0


In [16]:
## check data descriptions for the data frane

print(Data_DF.info())

print(Data_DF.describe()) # include all columns not just numeric data 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378963 entries, 0 to 378962
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   Sex         378963 non-null  category
 1   AgeGroup    378963 non-null  category
 2   InjuryType  378963 non-null  category
dtypes: category(3)
memory usage: 4.0 MB
None
           Sex AgeGroup           InjuryType
count   378963   378963               378963
unique       3        9                    9
top        All      All  Accidental Exposure
freq    128286    43362                42107


By examination of the output above, we can see that there is no unexpected issues with the data. Specifically, there are no missing values. Also, from previous operations, we have converted all 3 variables to the proper data type-`categorical`.

Also, from the code below, data is in a **long format** which is suitable for the analysis

In [17]:
# checking to see what the data frame for the analysis looks like.

Data_DF

Unnamed: 0,Sex,AgeGroup,InjuryType
0,All,All,All Diagnoses
1,All,All,RTA
2,All,All,Poisoning
3,All,All,Falls
4,All,All,"Struck by, against"
...,...,...,...
378958,Female,75plus years,"Struck by, against"
378959,Female,75plus years,Crushing
378960,Female,75plus years,Scalds
378961,Female,75plus years,Accidental Exposure


#### Step 6
**Answering the question**: What is the most common cause of injury by a grouping of the categorical variables choosen?

Before we proceed to answer, first, lets I go a step further to order the categorical variable `AgeGroup`, this will ensure the column is sorted and we can make comparison of the values from the dataset.

Aslo, I filter out aggregate level responses in our variables, since these are not needed for the analysis.See the codes below.



In [18]:
# Ordering the categorical variable AgeGroup to ensure the column is sorted.

Data_DF['AgeGroup'] = Data_DF['AgeGroup'].cat.as_ordered()

In [19]:
# This code filter out the aggregate level responses since these are not needed in the analysis.

Data_DF = Data_DF.loc[~Data_DF["Sex"].isin(["All"])&  
                    (~Data_DF["AgeGroup"].isin(["All"])) & 
                    (~Data_DF["InjuryType"].isin(["All Diagnoses"]))]

print(Data_DF)


           Sex      AgeGroup           InjuryType
181       Male     0-4 years                  RTA
182       Male     0-4 years            Poisoning
183       Male     0-4 years                Falls
184       Male     0-4 years   Struck by, against
185       Male     0-4 years             Crushing
...        ...           ...                  ...
378958  Female  75plus years   Struck by, against
378959  Female  75plus years             Crushing
378960  Female  75plus years               Scalds
378961  Female  75plus years  Accidental Exposure
378962  Female  75plus years                Other

[197152 rows x 3 columns]


Finally, we having gotten the data to the right data structure and type for the variables of interest. There are no missing values as previous checked.

**I now answer the question for `Task 1` using the code below**. 

The code groups the data frame `Data_DF` by `AgeGroup` and `Sex columns`, so that we have unique group of sex and age for each row. Then for each unique row, we count the number of observation for each injury type. The output is seen below.  


In [20]:
# Grouping the data frame by AgeGroup and Sex, then counting the number of observation for each injury type in each unique row.

Data_group = Data_DF.groupby(['AgeGroup', 'Sex'])['InjuryType'].value_counts().to_frame()
Data_group

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,InjuryType
AgeGroup,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
0-4 years,Female,Accidental Exposure,1382
0-4 years,Female,Crushing,1382
0-4 years,Female,Falls,1382
0-4 years,Female,Other,1382
0-4 years,Female,Poisoning,1382
...,...,...,...
75plus years,Male,Poisoning,1584
75plus years,Male,RTA,1584
75plus years,Male,Scalds,1584
75plus years,Male,"Struck by, against",1584


#### Step 7
While the table above shows the answer to the question, It is a bit unclear to read and glance through. I resolve this by reshaping the data frame. Specifically, I make the **data frame wider** so that we can better answer the question. Since we are particularly interested in Injury type, I widen the data frame by collapsing the it by injury type so that the each injury type is presented in a separate column. This operation is implemented in the code below.

In [21]:
# Widening the data frame by InjuryType, so that each injury is presented in a separate column

group_unstack = Data_DF.groupby(['AgeGroup', 'Sex'])['InjuryType'].value_counts().unstack()
group_unstack

Unnamed: 0_level_0,Unnamed: 1_level_0,Accidental Exposure,Crushing,Falls,Other,Poisoning,RTA,Scalds,"Struck by, against",All Diagnoses
AgeGroup,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0-4 years,Female,1382,1382,1382,1382,1382,1382,1382,1382,0
0-4 years,Male,1451,1451,1451,1451,1451,1451,1451,1451,0
10-14 years,Female,1456,1456,1456,1456,1456,1456,1456,1456,0
10-14 years,Male,1516,1516,1516,1516,1516,1516,1516,1516,0
15-24 years,Female,1528,1528,1528,1528,1528,1528,1528,1528,0
15-24 years,Male,1582,1582,1582,1582,1582,1582,1582,1582,0
25-44 years,Female,1574,1574,1574,1574,1574,1574,1574,1574,0
25-44 years,Male,1597,1597,1597,1597,1597,1597,1597,1597,0
45-64 years,Female,1595,1595,1595,1595,1595,1595,1595,1595,0
45-64 years,Male,1600,1600,1600,1600,1600,1600,1600,1600,0


In [22]:
# We can further widen the data frame by Sex and InjuryType for a more clearer view.

group_unstack2 = Data_DF.groupby(['AgeGroup', 'Sex'])['InjuryType'].value_counts().unstack(level = [-1, 1])
group_unstack2

Unnamed: 0_level_0,Accidental Exposure,Crushing,Falls,Other,Poisoning,RTA,Scalds,"Struck by, against",All Diagnoses,Accidental Exposure,Crushing,Falls,Other,Poisoning,RTA,Scalds,"Struck by, against",All Diagnoses
Sex,Female,Female,Female,Female,Female,Female,Female,Female,Female,Male,Male,Male,Male,Male,Male,Male,Male,Male
AgeGroup,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
0-4 years,1382,1382,1382,1382,1382,1382,1382,1382,0,1451,1451,1451,1451,1451,1451,1451,1451,0
10-14 years,1456,1456,1456,1456,1456,1456,1456,1456,0,1516,1516,1516,1516,1516,1516,1516,1516,0
15-24 years,1528,1528,1528,1528,1528,1528,1528,1528,0,1582,1582,1582,1582,1582,1582,1582,1582,0
25-44 years,1574,1574,1574,1574,1574,1574,1574,1574,0,1597,1597,1597,1597,1597,1597,1597,1597,0
45-64 years,1595,1595,1595,1595,1595,1595,1595,1595,0,1600,1600,1600,1600,1600,1600,1600,1600,0
5-9 years,1495,1495,1495,1495,1495,1495,1495,1495,0,1544,1544,1544,1544,1544,1544,1544,1544,0
65-74 years,1576,1576,1576,1576,1576,1576,1576,1576,0,1578,1578,1578,1578,1578,1578,1578,1578,0
75plus years,1586,1586,1586,1586,1586,1586,1586,1586,0,1584,1584,1584,1584,1584,1584,1584,1584,0


#### Task 1 conclusion

**Answer** As can be observed from the output above, it clearly all injury type represented by the columns, have the same number of counts or occurrence.However, we can still pick a conclusion from the analysis. Looking at the output, **Male and Female within the Ages of 45-64 years have the highest number of count for all types of injury- 1600 for Male and 1595 for Female see the output of line 21)**. One conclusion we can deduce is that Male and Female within that group have the highest plausible risk of injury for the different types of injury from the data. 

##                                                   Task 2

**Question**

> Explore the following data object x and present it in a more clear and better structured format, ensuring that the data structure and data types are appropriate.

To answer this question, first, lets load the data object x into the notebook. I simply copy and paste the data object x from the pdf to the jupyter notebook. I also check the data type and structure of x 


In [23]:
# Copying and pasting the object x to my jupyter

x = [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
1, 0, 0, 0, 1, 0, 0, "red", "yellow", "yellow", "blue",
"yellow", "blue", "pink", "orange","green", "purple","orange",
"yellow", "blue", "red", "blue", "orange","orange", "orange",
"orange", "pink", "orange", "red", "green","blue", "orange",
2019, 1974, 1996, 2007, 2018, 1997, 2008, 1984, 2002, 2023,
1998, 1998, 1975, 2009, 1988, 2003, 1972, 1997, 2017, 1988,
1999, 1987, 2010, 1976, 2011, 80.587, 112.955, -27.760, 78.151,
-73.151, 57.707, 73.230, 91.306, 55.270, 67.374, -2.671, 50.970,
66.582, -14.618, 63.790, 47.689, 76.683, 96.753, 9.792, -45.848,
32.098, 86.168, 77.696, -1.677, 47.680]

# Checking the data type of the object x

type(x)

list

In [24]:
# Checking the length of x
len(x)

100

## Looking at the data, I can observe several data type

* `String` such as colours: blue, yellow, etc.
* `Numeric` there are `integers` like 0 and 1, and years like 2008, 2009. There are also `flaot` like 80.587

* Even though 0 and 1 are integers, we can also treat them as `boolean` data type such that `TRUE=1` and `FALSE = 0`. To better structure and make sense out of the dataset in x, I choose to represent ` 1 and 0` using the `boolean form`, just to create a better structure of the data and to distinguish them from other `non-zero` values in the list,x.

###  Steps in task 2

* First, **remember that x is a list data structure**, so even as I extract items from x, I will attempt to store them in the appropriate data type. To present the dataset in a more structured form, I will attempt to perform the following steps:

1. Extract from list x, the 1 and 0 and storing them in their `boolean` form in a separate list.
2. Extract from list x, the colours and storing them in their ` string data type` form in a separate list.
3. Extract from list x, the years, and storing them as a `integer data type` in a separate list.
4. Extract from list x, the float (decimal values), and storing them as `float data type` in a separate list.
5. Converting the resulting 4 list from above into a `Dictionary` data structure
6. Converting the `Dictionary` into a `data frame` with 4 columns each containing the different data type in a more structured form than originally presented.

#### Step 1
As earlier discussed, treating 1 and 0 as `boolean data type`, we extract them from the list x into a new list named `Boolean_type`. This operation is implemented in the code below, using a list comprehension operation.

In [25]:
# Extracting the boolean data type in x using list comprehension

Boolean_type = [bool(x) for x in x if x in [0, 1]]

print(Boolean_type)

len(Boolean_type) # printing the length of the list

[True, False, True, False, False, False, False, False, True, False, True, True, True, True, True, False, True, True, True, False, False, False, True, False, False]


25

#### Step 2
Extract the `string data type` from the list x into a new list named `String_type`. This operation is implemented in the code below, using a list comprehension operation.

In [27]:
# Extracting the string data type in x using list comprehension

String_type = [x for x in x if type(x) == str]
print(String_type)
len(String_type) # printing the length of the list


['red', 'yellow', 'yellow', 'blue', 'yellow', 'blue', 'pink', 'orange', 'green', 'purple', 'orange', 'yellow', 'blue', 'red', 'blue', 'orange', 'orange', 'orange', 'orange', 'pink', 'orange', 'red', 'green', 'blue', 'orange']


25

#### Step 3
Here we extract the `float data type` from the list x into a new list named `Float_type`. This operation is implemented in the code below, using a list comprehension operation.

In [26]:
# Extracting the float data type in x using list comprehension

Float_type = [x for x in x if type(x) == float]

print(Float_type)
len(Float_type) # printing the length of the list

[80.587, 112.955, -27.76, 78.151, -73.151, 57.707, 73.23, 91.306, 55.27, 67.374, -2.671, 50.97, 66.582, -14.618, 63.79, 47.689, 76.683, 96.753, 9.792, -45.848, 32.098, 86.168, 77.696, -1.677, 47.68]


25

#### Step 4
In this step we extract the `Integer data type`from the list x into a new list named `Integer_type`. This operation is implemented in the code below, using a list comprehension operation.

In [28]:
# Extracting the float data type in x using list comprehension
Integer_type = [x for x in x if isinstance(x, int) and x not in [0, 1]]
print(Integer_type)

len(Integer_type) # printing the length of the list

[2019, 1974, 1996, 2007, 2018, 1997, 2008, 1984, 2002, 2023, 1998, 1998, 1975, 2009, 1988, 2003, 1972, 1997, 2017, 1988, 1999, 1987, 2010, 1976, 2011]


25

#### Step 5

* In the last steps, I bring everything together. First, I convert all the 4 list into a `dictionary data structure` called `My_dict`

In [29]:
# joining all 4 list into a dictionary data structure called My_dict

My_dict = {"Float_data": Float_type, "String_data": String_type, "Boolean_data": Boolean_type, "Integer_data": Integer_type}

print(My_dict)


{'Float_data': [80.587, 112.955, -27.76, 78.151, -73.151, 57.707, 73.23, 91.306, 55.27, 67.374, -2.671, 50.97, 66.582, -14.618, 63.79, 47.689, 76.683, 96.753, 9.792, -45.848, 32.098, 86.168, 77.696, -1.677, 47.68], 'String_data': ['red', 'yellow', 'yellow', 'blue', 'yellow', 'blue', 'pink', 'orange', 'green', 'purple', 'orange', 'yellow', 'blue', 'red', 'blue', 'orange', 'orange', 'orange', 'orange', 'pink', 'orange', 'red', 'green', 'blue', 'orange'], 'Boolean_data': [True, False, True, False, False, False, False, False, True, False, True, True, True, True, True, False, True, True, True, False, False, False, True, False, False], 'Integer_data': [2019, 1974, 1996, 2007, 2018, 1997, 2008, 1984, 2002, 2023, 1998, 1998, 1975, 2009, 1988, 2003, 1972, 1997, 2017, 1988, 1999, 1987, 2010, 1976, 2011]}


#### Step 6

In this final step, convert `My_dict` from a `dictionary` data structure to a pandas `data frame` and named this `My_dataframe` 

In [30]:
# converting the dictionary My_dict to a pandas dataframe named My_dataframe
My_dataframe = pd.DataFrame.from_dict(My_dict)

My_dataframe

Unnamed: 0,Float_data,String_data,Boolean_data,Integer_data
0,80.587,red,True,2019
1,112.955,yellow,False,1974
2,-27.76,yellow,True,1996
3,78.151,blue,False,2007
4,-73.151,yellow,False,2018
5,57.707,blue,False,1997
6,73.23,pink,False,2008
7,91.306,orange,False,1984
8,55.27,green,True,2002
9,67.374,purple,False,2023


In [31]:
print(My_dataframe.shape) #checking the shape of the dataframe

(25, 4)


#### Task 2 conclusion

**Answer** As can be observed above, we now have a clearer and more structured presentation of the dataset from the object x, now transformed to a pandas data frame called `My_dataframe`. In summary, we have 25 rows and 4 columns with each column representing a distinct data type: **float, string, boolean, and integer.**


###### Bonus
As a bonus addition, we can further make String_data a category data type since it contains several colours. This is shown in the code below.

In [32]:
My_dataframe["String_data"] = My_dataframe["String_data"].astype("category")
My_dataframe.String_data.dtype

CategoricalDtype(categories=['blue', 'green', 'orange', 'pink', 'purple', 'red',
                  'yellow'],
, ordered=False)

# The End

