# ICAEW Data Analytics Certificate Programme Case Study (Analyst pathway)
## Data wrangling

Welcome to the Case Study for the ICAEW Data Analytics Certificate Programme (Analyst pathway). 

We have discussed the background and context and examined the data and techniques to be used in the case study already. Now we will use the skills we have learnt from this course to analyse the sales of the company we may be investing in.


We'll use the Journals and Sales datasets provided. The structure of this notebook is as follows:

- First, we will start off by loading and viewing the datasets.
- We will see that the datasets have a mixture of both numerical and non-numerical features.
- We will see that the datasets have various data quality issues, such as they contain a number of missing entries and duplicates.
- We will use techniques covered in the course to address these issues and prepare the data for analysis.
- Finally, we will append and join any relevant datasets together to wrangle our data into a usable format.

## Loading and viewing the datasets

First, loading and viewing the datasets. The sales dataset and the journals dataset provided are an Excel, and two csv files respectively. We examined how to load Excel and csv datasets within the course in Module 1 of Unit 2: Data Wrangling.

In [5]:
# Import packages required
import pandas as pd
# Load the sales dataset as sales
sales = pd.read_excel('SalesCS.xlsx')
# Examine the sales dataset to ensure it has been read in accurately and examine the data contained
sales.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Postal Code,City,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,4042,MX-2015-AB1001539-42353,2019-12-15,2019-12-19,Standard Class,AB-1001539,Aaron Bergman,Consumer,,Apopa,...,FUR-CH-5379,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",610.6,2,0.0,238.12,57.833,Medium
1,4041,MX-2015-AB1001539-42353,2019-12-15,2019-12-19,Standard Class,AB-1001539,Aaron Bergman,Consumer,,Apopa,...,OFF-SU-2966,Office Supplies,Supplies,"Acme Box Cutter, High Speed",151.2,6,0.0,75.6,10.786,Medium
2,24145,IN-2015-AB1001558-42256,2019-09-09,2019-09-09,Same Day,AB-1001558,Aaron Bergman,Consumer,,Hubli,...,OFF-BI-6383,Office Supplies,Binders,"Wilson Jones Binding Machine, Durable",50.46,1,0.0,22.68,10.54,Critical
3,24144,IN-2015-AB1001558-42256,2019-09-09,2019-09-09,Same Day,AB-1001558,Aaron Bergman,Consumer,,Hubli,...,OFF-BI-3737,Office Supplies,Binders,"Cardinal Index Tab, Clear",26.88,4,0.0,12.0,6.55,Critical
4,26085,ID-2015-AB1001559-42178,2019-06-23,2019-06-27,Second Class,AB-1001559,Aaron Bergman,Consumer,,Palembang,...,FUR-FU-3935,Furniture,Furnishings,"Deflect-O Door Stop, Erganomic",372.9132,12,0.27,101.8332,53.07,Medium


In [278]:
# Load the first journals dataset as journals1
journals1= pd.read_csv('Journals Part 1.csv')

# Examine it to ensure it has been read in accurately and examine the data contained
journals1.head()

Unnamed: 0,Account,AccountDesc,TransDesc,Debit,Credit,Period,JnlNo,JnlDesc,Amount,JnlPrep,JnlAuth,JnlDateTime
0,00-80-8033,Provision for Sales Schemes,ZZX,0.0,9668.59,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,-9668.59,HV09,AS13,01/01/2019 13:04
1,00-10-1002,Provisions - Trade Sales,XXX,9668.59,0.0,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,9668.59,HV09,AS13,01/01/2019 13:04
2,00-80-8033,Provision for Sales Schemes,ZZX,0.0,291191.3,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,-291191.3,HV09,AS13,01/01/2019 13:04
3,00-10-1001,Trade Sale Recycle Scheme,ZZZ,291191.3,0.0,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,291191.3,HV09,AS13,01/01/2019 13:04
4,00-20-2004,Provision for Obselete Inventory,923,0.0,12848.5,2019-1,10,Reversed By Jnl 2019-4 Journal No. 366,-12848.5,DF18,TC01,30/01/2019 09:56


In [280]:
# Load the second journals dataset as journals2
journals2 = pd.read_csv('Journals Part 2.csv')

# Examine it to ensure it has been read in accurately and examine the data contained
journals2.head()

Unnamed: 0,Account,AccountDesc,TransDesc,Debit,Credit,Period,JnlNo,JnlDesc,Amount,JnlPrep,JnlAuth,JnlDateTime
0,00-80-8043,Tax Control,XXX,19763.35,0.0,2019-7,253,,19763.35,DF18,AM04,01/07/2019 16:44
1,00-80-8044,Tax Input,XZZX,0.0,19763.35,2019-7,253,,-19763.35,DF18,AM04,01/07/2019 16:44
2,00-80-8043,Tax Control,XXX,0.0,22956.9,2019-7,253,,-22956.9,DF18,AM04,01/07/2019 16:44
3,00-80-8045,Tax Output,ZZXXZ,22956.9,0.0,2019-7,253,,22956.9,DF18,AM04,01/07/2019 16:44
4,00-80-8045,Tax Output,ZZXXZ,40240.3,0.0,2019-7,253,,40240.3,DF18,AM04,01/07/2019 16:44


## Examining and setting data types

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. By examining the data types, we can see how Python has interpreted these features and we can change the types of any fields, as required.


In [15]:
# Examine the data types of the sales data
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51308 entries, 0 to 51307
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Row ID          51308 non-null  int64         
 1   Order ID        51308 non-null  object        
 2   Order Date      51308 non-null  datetime64[ns]
 3   Ship Date       51308 non-null  datetime64[ns]
 4   Ship Mode       51308 non-null  object        
 5   Customer ID     51308 non-null  object        
 6   Customer Name   51308 non-null  object        
 7   Segment         51286 non-null  object        
 8   Postal Code     9998 non-null   float64       
 9   City            51308 non-null  object        
 10  State           51308 non-null  object        
 11  Country         51308 non-null  object        
 12  Region          51308 non-null  object        
 13  Market          51248 non-null  object        
 14  Product ID      51308 non-null  object        
 15  Ca

In [282]:
# Examine the data types of the first journals data
journals1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Account      1800 non-null   object 
 1   AccountDesc  1800 non-null   object 
 2   TransDesc    1800 non-null   object 
 3   Debit        1800 non-null   float64
 4   Credit       1800 non-null   float64
 5   Period       1800 non-null   object 
 6   JnlNo        1800 non-null   int64  
 7   JnlDesc      316 non-null    object 
 8   Amount       1800 non-null   float64
 9   JnlPrep      1800 non-null   object 
 10  JnlAuth      1800 non-null   object 
 11  JnlDateTime  1783 non-null   object 
dtypes: float64(3), int64(1), object(8)
memory usage: 168.9+ KB


In [284]:
# Examine the data types of the second journals data
journals2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Account      2201 non-null   object 
 1   AccountDesc  2201 non-null   object 
 2   TransDesc    2201 non-null   object 
 3   Debit        2201 non-null   float64
 4   Credit       2201 non-null   float64
 5   Period       2201 non-null   object 
 6   JnlNo        2201 non-null   int64  
 7   JnlDesc      400 non-null    object 
 8   Amount       2201 non-null   float64
 9   JnlPrep      2201 non-null   object 
 10  JnlAuth      2201 non-null   object 
 11  JnlDateTime  2201 non-null   object 
dtypes: float64(3), int64(1), object(8)
memory usage: 206.5+ KB


In [None]:
# Set any data types, if required


## Identifying any data quality issues

We've now examined the data fields and types contained in our data. Next, we should examine the quality of our data. Two of the biggest data quality issues that can affect and distort our analysis are duplicates and missing data. 

Now, examine the datasets provided to identify if there are any duplicate rows or entries in the dataset and if any fields have missing values. 

We have covered how to check for duplicates and for missing values within a dataset in Module 3 of Unit 2: Data Wrangling ‘Duplicates and missing datasets’. If you purchased the learning and certificate Analyst Pathway, we strongly recommend you revisit this content if you are struggling to complete these tasks.

In [288]:
# Examine the number of duplicates in the journal dataset
journals = pd.concat([journals1,journals2])
duplicates = journals.duplicated()
duplicated_values = journals[duplicates].sort_values(by = 'Amount')
print(duplicated_values['JnlNo'].count())

40


##### <b>Please take a note of the number of duplicate journal entries.</b>

<i>Select here to type your answer: 40

The following is assessed in the assessment.

We can examine the duplicates by using the duplicated() command. This produces a boolean indicating whether each row is a duplicate or not, with True signifying the sales order is a duplicate.

In [33]:
# Examine the number of duplicates in the sales dataset
duplicates = sales.duplicated()
duplicated_values = sales[duplicates].sort_values(by = 'Order ID')
print(duplicated_values['Order ID'].count())

18


##### <b>Please take a note of the number of duplicate sales orders.</b>

<i>Select here to type your answer: 18

The following is assessed in the assessment.

We can identify missing values by using the is.na() command, which returns a boolean for each data point on whether it is a missing value or not.

In [290]:
# Examine if there are any missing values in the first journal dataset
journals.isna().sum()

Account           0
AccountDesc       0
TransDesc         0
Debit             0
Credit            0
Period            0
JnlNo             0
JnlDesc        3285
Amount            0
JnlPrep           0
JnlAuth           0
JnlDateTime      17
dtype: int64

##### <b>Please take a note of the fields which have missing values in the journals dataset.</b>

<i>Select here to type your answer: JnlDesc, JnlDateTime

In [62]:
# Examine if there are any missing values in the sales dataset
sales.isna().sum()

Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment              22
Postal Code       41310
City                  0
State                 0
Country               0
Region                0
Market               60
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64

The following is assessed in the assessment.

We can identify missing values by using the is.na() command, which returns a boolean for each data point on whether it is a missing value or not.

##### <b>Please take a note of the number of missing values in the Segment column of the sales dataset.</b>

<i>Select here to type your answer: 22

## Handling and resolving data quality issues

We have identified that there are numerous missing values or duplicate entries in all of the datasets. First we will handle the missing values. 

### Missing values

There are several fields in the data that have missing values to a different extent. The best way to handle these fields depends on what is missing. 

First, let's consider the sales data. We have identified there are missing values for the customer's post code, the order's market and the customer segment. Due to the amount of missing values for Post Code and the fact it is unlikely we could infer them, it may be best to drop Post Code from our sales dataset. However, it may be possible to suggest a suitable value for customer segment and order's market.

While we examined missing values in Module 3 of Unit 2: Data Wrangling, to identify these values we will use the practical skills of subsetting and selecting data that we learnt in Module 2 of Unit 3: Analysing the Data. If you purchased the learning and certificate Analyst Pathway, we recommend you revisit this content  if you are struggling to complete these tasks.

In [64]:
# Drop Post Code from the sales data
sales = sales.drop('Postal Code', axis=1)

In [67]:
# Examine the rows which have missing values for Market in the sales data and examine the regions
sales['Market'].isna()

0        False
1        False
2        False
3        False
4        False
         ...  
51303    False
51304    False
51305    False
51306    False
51307    False
Name: Market, Length: 51308, dtype: bool

In [83]:
# Identify the missing value for Market in the sales data using the region field
sales[sales.isnull().any(axis=1)]

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
190,4260,MX-2013-AS1004598-41501,2017-08-15,2017-08-20,Standard Class,AS-1004598,Aaron Smayling,Corporate,Panama City,Panama,...,OFF-SU-4117,Office Supplies,Supplies,"Elite Box Cutter, High Speed",13.956,1,0.4,-5.5840,1.252,Medium
1127,7313,US-2012-AB1025582-40928,2016-01-20,2016-01-25,Standard Class,AB-1025582,Alejandro Ballentine,Home Office,Mexico City,Distrito Federal,...,FUR-BO-3640,Furniture,Bookcases,"Bush Library with Doors, Mobile",195.648,1,0.2,-34.2520,17.083,Medium
1272,347,MX-2015-AG1030082-42367,2019-12-29,2020-01-01,First Class,AG-1030082,Aleksandra Gannaway,Corporate,Morelia,Michoacán,...,OFF-EN-5030,Office Supplies,Envelopes,"Kraft Interoffice Envelope, Security-Tint",328.800,10,0.0,147.8000,58.451,Medium
4664,912,MX-2012-BP1109582-40933,2016-01-25,2016-01-29,Standard Class,BP-1109582,Bart Pistole,Corporate,Coatzacoalcos,Veracruz,...,OFF-EN-4452,Office Supplies,Envelopes,"GlobeWeis Peel and Seal, Security-Tint",31.720,2,0.0,10.1200,4.451,High
5260,885,MX-2013-BF1121551-41503,2017-08-17,2017-08-19,Second Class,BF-1121551,Benjamin Farhat,Home Office,Huehuetenango,Huehuetenango,...,FUR-TA-3356,Furniture,Tables,"Barricks Wood Table, Adjustable Height",276.864,1,0.2,96.8840,31.932,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50935,38348,CA-2014-WB21850140-41985,2018-12-12,2018-12-12,Same Day,WB-218501404,William Brown,,Anaheim,California,...,OFF-AR-3518,Office Supplies,Art,Boston 16765 Mini Stand Up Battery Pencil Shar...,23.320,2,0.0,6.0632,6.790,Critical
50936,38357,CA-2014-WB21850140-41985,2018-12-12,2018-12-12,Same Day,WB-218501404,William Brown,,Anaheim,California,...,OFF-BI-3506,Office Supplies,Binders,"Black Avery Memo-Size 3-Ring Binder, 5 1/2"" x ...",5.872,2,0.2,2.1286,1.650,Critical
51296,26172,IN-2012-ZD2192511-41247,2016-12-04,2016-12-09,Standard Class,ZD-2192511,Zuschuss Donatelli,,Dhaka,Dhaka,...,OFF-AR-3483,Office Supplies,Art,"Binney & Smith Highlighters, Fluorescent",107.100,6,0.0,13.8600,14.800,High
51297,26173,IN-2012-ZD2192511-41247,2016-12-04,2016-12-09,Standard Class,ZD-2192511,Zuschuss Donatelli,,Dhaka,Dhaka,...,OFF-BI-6385,Office Supplies,Binders,"Wilson Jones Binding Machine, Recycled",98.640,2,0.0,31.5600,12.100,High


In [86]:
# Fill in the missing values for Market in the sales data and save the changes to this field
sales['Market'] = sales['Market'].fillna('Unkown')

In [90]:
# Examine the rows that have missing values for Customer Segment in the sales data and create a subset or list of the Customer Names
missing = sales[sales.isnull().any(axis=1)]
missingsubset = missing['Customer Name']
missingsubset

17056           Erica Smith
17057           Erica Smith
17058           Erica Smith
17059           Erica Smith
17060           Erica Smith
46902          Susan Pistek
46903          Susan Pistek
46904          Susan Pistek
46938          Susan Pistek
46939          Susan Pistek
46940          Susan Pistek
50929         William Brown
50930         William Brown
50931         William Brown
50932         William Brown
50933         William Brown
50934         William Brown
50935         William Brown
50936         William Brown
51296    Zuschuss Donatelli
51297    Zuschuss Donatelli
51305    Zuschuss Donatelli
Name: Customer Name, dtype: object

In [102]:
# Examine the sales orders for all customers in the list and identify the missing value for Segment
print(sales[sales['Customer Name'] == 'Erica Smith'])

       Row ID                  Order ID Order Date  Ship Date       Ship Mode  \
17037   10625   ES-2015-ES1402048-42369 2019-12-31 2020-01-04  Standard Class   
17038   12057   ES-2015-ES1402045-42343 2019-12-05 2019-12-10  Standard Class   
17039   12058   ES-2015-ES1402045-42343 2019-12-05 2019-12-10  Standard Class   
17040   12056   ES-2015-ES1402045-42343 2019-12-05 2019-12-10  Standard Class   
17041   45974    KE-2015-ES402069-42341 2019-12-03 2019-12-03        Same Day   
...       ...                       ...        ...        ...             ...   
17099   49918    BO-2012-ES402013-41147 2016-08-26 2016-09-01  Standard Class   
17100    1647   US-2012-ES1402018-41107 2016-07-17 2016-07-21  Standard Class   
17101   17296   ES-2012-ES1402045-41102 2016-07-12 2016-07-14     First Class   
17102   26010  IN-2012-ES14020130-41047 2016-05-18 2016-05-23    Second Class   
17103   26009  IN-2012-ES14020130-41047 2016-05-18 2016-05-23    Second Class   

       Customer ID Customer

In [114]:
# Resolve the missing values for Customer Segment in the sales data and save the changes to this field
sales.sort_values('Customer Name')
sales.fillna(method = 'ffill', inplace = True)

  sales.fillna(method = 'ffill', inplace = True)


We have identified that several journal entries have missing values for the time and date the journal was posted. We may be able to infer these values from the rest of the journals data.

The following is assessed in the assessment.

We can identify the missing journal date and time by examining the other journal entries from the same journal number. This is because these will have been posted together at the same time.

In [293]:
# Identify the rows of the missing values for Journal Date Time
journals.JnlDateTime[journals.JnlDateTime.isna()]

1027    NaN
1028    NaN
1029    NaN
1030    NaN
1031    NaN
1032    NaN
1033    NaN
1034    NaN
1035    NaN
1036    NaN
1037    NaN
1038    NaN
1039    NaN
1040    NaN
1041    NaN
1042    NaN
1043    NaN
Name: JnlDateTime, dtype: object

In [295]:
# Identify the missing values for Journal Date Time
journals.sort_values('JnlDateTime')
journals['JnlDateTime'][1020:1050]

1020    30/06/2019 09:26
1021    30/06/2019 09:26
1022    30/06/2019 09:26
1023    30/06/2019 09:26
1024    31/01/2019 12:27
1025    31/01/2019 12:27
1026    31/01/2019 12:27
1027                 NaN
1028                 NaN
1029                 NaN
1030                 NaN
1031                 NaN
1032                 NaN
1033                 NaN
1034                 NaN
1035                 NaN
1036                 NaN
1037                 NaN
1038                 NaN
1039                 NaN
1040                 NaN
1041                 NaN
1042                 NaN
1043                 NaN
1044    30/06/2019 12:47
1045    30/06/2019 12:47
1046    30/06/2019 15:15
1047    30/06/2019 15:15
1048    30/06/2019 15:15
1049    30/06/2019 15:15
Name: JnlDateTime, dtype: object

##### <b>Please take a note of the missing value in Journal Date Time.</b>

<i>Select here to type your answer: 31/01/2019 12:27

In [297]:
# Fill in the missing values for Journal Date Time 
journals['JnlDesc'].fillna('Blank')
journals.fillna(method = 'ffill', inplace = True)

  journals.fillna(method = 'ffill', inplace = True)


In [299]:
# Check the missing values have been filled in
journals['JnlDateTime'][1020:1050]

1020    30/06/2019 09:26
1021    30/06/2019 09:26
1022    30/06/2019 09:26
1023    30/06/2019 09:26
1024    31/01/2019 12:27
1025    31/01/2019 12:27
1026    31/01/2019 12:27
1027    31/01/2019 12:27
1028    31/01/2019 12:27
1029    31/01/2019 12:27
1030    31/01/2019 12:27
1031    31/01/2019 12:27
1032    31/01/2019 12:27
1033    31/01/2019 12:27
1034    31/01/2019 12:27
1035    31/01/2019 12:27
1036    31/01/2019 12:27
1037    31/01/2019 12:27
1038    31/01/2019 12:27
1039    31/01/2019 12:27
1040    31/01/2019 12:27
1041    31/01/2019 12:27
1042    31/01/2019 12:27
1043    31/01/2019 12:27
1044    30/06/2019 12:47
1045    30/06/2019 12:47
1046    30/06/2019 15:15
1047    30/06/2019 15:15
1048    30/06/2019 15:15
1049    30/06/2019 15:15
Name: JnlDateTime, dtype: object

### Duplicates

We have identified that there are duplicate entries in all of the datasets.

In the first journals dataset, there is a whole journal that has been duplicated and a few duplicated journal entries. This is also the case for the second journals dataset. Lastly, there are several sales orders that are duplicated in the sales data. To avoid these compromising our analysis, we will remove these duplicates.

We have examined how to drop duplicates in Module 3 of Unit 2: Data Wrangling. If you purchased the learning and certificate Analyst Pathway, we recommend you revisit this content if you are struggling to complete these tasks.

In [301]:
# Remove the duplicates in the journals data and save this dataset as journals
journals.drop_duplicates(inplace = True)

In [176]:
# Remove the duplicates in the sales data and save this dataset as sales
sales.drop_duplicates(inplace = True)

## Wrangling the data

Now we have resolved our data quality issues, we can wrangle our data into a usable format for our analysis. 

Until now, we have been using three datasets, two of which are journal datasets covering different periods. For our analysis, it is better to use a single journal dataset to allow us to examine and analyse the journals data as a whole.

We practised appending the journals data and performing joins in Module 2 of Unit 2: Data Wrangling. If you purchased the learning and certificate Analyst Pathway, we recommend you revisit this content if you are struggling to complete these tasks.


In addition to the journal datasets provided, we have obtained a dataset containing the employee names for the accounts team responsbile for preparing and authorising the journals. It will be beneficial to include this information in the journal dataset.

In [216]:
# Import the Accounts Team dataset as AccountsTeam
accountsteam = pd.read_csv('Accounts Team Staff.csv')
# Examine the data
accountsteam.head()

Unnamed: 0,EmployeeRef,Employee Name
0,AC04,Lorelei Ory
1,AH12,Latasha Terpstra
2,JC39,Merrill Benzel
3,ID03,Evalyn Reddout
4,HV09,Lakeisha Testerman


In [303]:
# Join the Accounts Team data to the journals dataset

# First, join on the Journal Preparer
journals = journals.merge(accountsteam, how = 'left', left_on = 'JnlPrep', right_on = 'EmployeeRef')

In [305]:
# Rename the new column as JnlPreparerName and drop redundant columns EmployeeRef and Employee Name 
journals['JnlPreparerName'] = journals['Employee Name']
journals = journals.drop(columns = ['EmployeeRef','Employee Name'])

In [307]:
# Next, join on the Journal Authoriser field
journals = journals.merge(accountsteam, how = 'left', left_on='JnlAuth',right_on = 'EmployeeRef')

In [309]:
# Again rename new column as JnlAuthoriserName and drop redundant columns EmployeeRef and Employee Name 
journals['JnlAuthoriserName'] = journals['Employee Name']
journals = journals.drop(columns = ['EmployeeRef', 'Employee Name'])

In [311]:
# Examine your final data
journals.head()

Unnamed: 0,Account,AccountDesc,TransDesc,Debit,Credit,Period,JnlNo,JnlDesc,Amount,JnlPrep,JnlAuth,JnlDateTime,JnlPreparerName,JnlAuthoriserName
0,00-80-8033,Provision for Sales Schemes,ZZX,0.0,9668.59,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,-9668.59,HV09,AS13,01/01/2019 13:04,Lakeisha Testerman,Jonelle Moseley
1,00-10-1002,Provisions - Trade Sales,XXX,9668.59,0.0,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,9668.59,HV09,AS13,01/01/2019 13:04,Lakeisha Testerman,Jonelle Moseley
2,00-80-8033,Provision for Sales Schemes,ZZX,0.0,291191.3,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,-291191.3,HV09,AS13,01/01/2019 13:04,Lakeisha Testerman,Jonelle Moseley
3,00-10-1001,Trade Sale Recycle Scheme,ZZZ,291191.3,0.0,2019-1,1,Reversed By Jnl 2019-9 Journal No. 277,291191.3,HV09,AS13,01/01/2019 13:04,Lakeisha Testerman,Jonelle Moseley
4,00-20-2004,Provision for Obselete Inventory,923,0.0,12848.5,2019-1,10,Reversed By Jnl 2019-4 Journal No. 366,-12848.5,DF18,TC01,30/01/2019 09:56,Jon Mckinley,Johnny Hevey


The following is assessed in the assessment.

When dropping duplicates, we need to save the changes to our datasets in order to obtain the correct number of rows.
Similarly, we need to save the changes when appending the journal datasets together, and when we are merging datasets to add the employee names and dropping irrelevant columns.

In [313]:
# Examine the size of the final journal dataset
journals.shape

(3985, 14)

##### <b>Please take a note of the size of the final wrangled journals dataset.</b>

<i>Select here to type your answer: 3984,14

##### Assessment guidance

You are ready to take the assessment.

Remember, you should have fully completed your task and recorded your answers in Jupyter Notebook before moving on to the assessment. You can keep your Jupyter Notebook open in a separate browser window to refer to as you take the assessment.  

You will receive a score following completion of the assessment. If you have scored below the target mark of 60% for the section, you are recommended to refresh your knowledge in the course content (if purchased) and rework your Jupyter Notebook before re-attempting the assessment. You have a maximum of three assessment attempts.

You should aim to achieve a target score of 60% in each section of the case study. To pass the case study and be awarded the ICAEW Certificate, you are required to achieve a pass mark of 60% overall, averaged over all five sections, so do not be disheartened if you score below 60% in any one section, as a higher score in one or more of the other sections will contribute to the overall pass mark of 60%.

IMPORTANT: When submitting to the assessment portal, please do not navigate away from it until you have submitted all of your answers for that task. In between any of your 3 assessment attempts (but not during an attempt) you may navigate back to the course content, if you have purchased it, to refresh your knowledge and revisit your Jupyter Notebook to rework your analysis.