### Prepping Data Challenge: Working with Strings (week 9)
We have been given a set of messy strings, which contain useful information that we need to connect to other datasets to eventually find out how much revenue we have generated by selling different products. This string provides us with information such as the quantity of items sold, the product ID code, the phone number of the buyer, and the area code which will let us find out where they are purchasing from. There will also be some small calculations needed to join certain datasets together!
 
 - Customers often don't sing the entire song
 - Sessions last 60 minutes
 - Customers arrive a maximum of 10 minutes before their sessions begin

#### Requirement:
 1. Input the Customer Information file, split the values and reshape the data so there is a separate ID on each row. 
 2. Each ID field contains the following information we need to extract: 
    - The first 6 digits present in each ID is the customers phone number
    - The first 2 digits after the ‘,’ is the last 2 digits of the area code 
    - The letter following this is the first letter of the name of the area that they are calling from
    - The digits after this letter resemble the quantity of products ordered
    - The letters after the ‘-‘ are the product ID codes 
 3. Rename these fields appropriately, and remove any unwanted columns – leaving only these 5 columns in the workflow. 
 4. Input the Area Code Lookup Table – find a way to join it to the Customer information file 
 5. We don’t actually sell products in Clevedon, Fakenham, or Stornoway. Exclude these from our dataset 
 6. In some cases, the ID field does not provide accurate enough conditions to know where the customer is from. Exclude any phone numbers where the join has produced duplicated records.
 7. Remove any unwanted fields created from the join. 
 8. Join this dataset to our product lookup table. 
 9. For each area, and product, find the total sales values, rounded to zero decimal places 
 10. Rank how well each product sold in each area. 
 11. For each area, work out the percent of total that each different product contributes to the overall revenue of that Area, rounded to 2 decimal places. 
 12. Output the data 

### 1. Input the Customer Information file, split the values and reshape the data so there is a separate ID on each row.

In [1]:
#import libraries
import pandas as pd

In [2]:
area_code = pd.read_excel('WK9-Area Code Lookup.xlsx')
customer_info = pd.read_excel('WK9-Customer Information.xlsx')
product = pd.read_excel('WK9-Product Lookup.xlsx')

In [3]:
area_code.head()

Unnamed: 0,Code,Area
0,114,Sheffield
1,115,Nottingham
2,116,Leicester
3,117,Bristol
4,118,Reading


In [4]:
customer_info.head()

Unnamed: 0,IDs
0,"Ju856452,13S24-SPL wd234175,29M77-SPL KZ621372..."
1,"jM391563,00C69-SPL Uc296328,17S73-SBP EL580409..."
2,"rV469041,02L68-HS Rn519453,20L22-SPL pd615208,..."
3,"GQ505960,03W64-SBP JS186662,22M1-SBP Id680462,..."
4,"bf677129,05D99-SBP MA755072,24A76-SBP Zf805822..."


In [5]:
product.head()

Unnamed: 0,Product ID,Product Name,Price
0,SBP,"Soap, Bar",£4.55
1,SPL,"Soap, Liquid",£6.50
2,HS,Hand Sanitiser,£2.29


In [5]:
#The explode() function is used to transform each element of a list-like to a row
ID = customer_info['IDs'].str.split(' ').explode()
df = pd.DataFrame()
df['IDs'] = ID
df.head()

Unnamed: 0,IDs
0,"Ju856452,13S24-SPL"
0,"wd234175,29M77-SPL"
0,"KZ621372,42K26-SBP"
0,"AY559207,53K50-HS"
1,"jM391563,00C69-SPL"


###  2. Each ID field contains the following information we need to extract

In [6]:
#The first 6 digits present in each ID is the customer phone number
df['Customer Phone Number'] = df['IDs'].str.extract(r'(\d{6})')

In [7]:
#The first 2 digits after the ',' is the last 2 digits of the Area Code
df['Area Code'] = df['IDs'].str.extract(r'(,\d{2})')
df['Area Code'] = df['Area Code'].str.extract(r'(\d{2})')

In [8]:
#The letter following this is the first letter of the name of the area that they are calling from
df['Area Name'] = df['IDs'].str.extract(r'(,\d{2}[A-Z]{1})')
df['Area Name'] = df['Area Name'].str.extract(r'([A-Z]{1})')

In [9]:
#The digits after this letter resemble the quantity of products ordered
df['Quantity Ordered'] = df['IDs'].str.extract(r'(,\d{2}[A-Z]{1}\d+)')
df['Quantity Ordered'] = df['Quantity Ordered'].str.extract(r'([A-Z]{1}\d+)')
df['Quantity Ordered'] = df['Quantity Ordered'].str.extract(r'(\d+)')

In [10]:
# The letters after the '-' are the product ID codes
df['Product ID'] = df['IDs'].str.extract(r'(-[A-Z]+)')
df['Product ID'] = df['Product ID'].str.extract(r'([A-Z]+)')

### 3. Rename these fields appropriately, and remove any unwanted columns – leaving only these 5 columns in the workflow. 

In [12]:
df.drop('IDs', axis= 'columns', inplace=True)

### 4. Input the Area Code Lookup Table – find a way to join it to the Customer information file 

In [14]:
area_code['2digits'] = area_code['Code'].astype(str).str[-2:] #get the last 2 digits
area_code['Name_letter'] = area_code['Area'].str[0] #get the first letter of the Area Name
final = pd.merge(df, area_code, left_on=['Area Code','Area Name'], right_on=['2digits','Name_letter'])
#final.head()

### 5. We don’t actually sell products in Clevedon, Fakenham, or Stornoway. Exclude these from our dataset 

In [16]:
remove_areas = ['Clevedon','Fakenham','Stornoway']
final = final[~final['Area'].isin(remove_areas)]

In [18]:
final.head()

Unnamed: 0,Customer Phone Number,Area Code,Area Name,Quantity Ordered,Product ID,Code,Area,2digits,Name_letter
16,234175,29,M,77,SPL,1629,Matlock,29,M
17,40676,29,M,89,SBP,1629,Matlock,29,M
18,34528,29,M,41,HS,1629,Matlock,29,M
19,242558,29,M,42,SBP,1629,Matlock,29,M
20,158593,29,M,4,SPL,1629,Matlock,29,M


### 6. In some cases, the ID field does not provide accurate enough conditions to know where the customer is from. Exclude any phone numbers where the join has produced duplicated records.

In [15]:
final = final.drop_duplicates(subset=['Customer Phone Number','Area Code','Area Name','Quantity Ordered','Product ID'], keep=False)

### 7. Remove any unwanted fields created from the join. 

In [16]:
final.drop(columns = ['Code','Area','2digits','Name_letter'], inplace=True)

### 8. Join this dataset to our product lookup table. 

In [17]:
final = pd.merge(final, product, left_on='Product ID', right_on='Product ID')

### 9. For each area, and product, find the total sales values, rounded to zero decimal places 

In [18]:
final['Amount'] = final['Price'].str[1:].astype(float)
final['Quantity Ordered'] = final['Quantity Ordered'].astype(float)
final['Total Sales'] = final['Amount'] * final['Quantity Ordered'] 

In [19]:
Area_Sales = final.groupby(['Area Name','Product Name'])['Total Sales'].sum()
Area_Sales = Area_Sales.to_frame().reset_index()
Area_Sales['Total Sales'] = Area_Sales['Total Sales'].round(0).astype(int)

### 10. Rank how well each product sold in each area. 

In [20]:
Area_Sales['Rank'] = Area_Sales.groupby(['Area Name']).rank(ascending=False)[['Total Sales']].astype(int)

### 11. For each area, work out the percent of total that each different product contributes to the overall revenue of that Area, rounded to 2 decimal places

In [21]:
Area_Sales['Area Total'] = Area_Sales['Total Sales'].groupby(Area_Sales['Area Name']).transform('sum')
Area_Sales['% of Total Product'] = Area_Sales['Total Sales'] / Area_Sales['Area Total']
Area_Sales['% of Total Product'] = Area_Sales['% of Total Product'].round(2)

### 12. Output the data 

In [22]:
final = Area_Sales[['Rank','Area Name','Product Name', '% of Total Product']]

In [23]:
final.to_csv('WK9-Working with strings Output.csv', index=False)

In [24]:
final.head()

Unnamed: 0,Rank,Area Name,Product Name,% of Total Product
0,3,A,Hand Sanitiser,0.2
1,2,A,"Soap, Bar",0.33
2,1,A,"Soap, Liquid",0.47
3,3,C,Hand Sanitiser,0.17
4,2,C,"Soap, Bar",0.29
