# Mobile Phones

## Import and cleanup

Let's start with importing important modules and loading the `phones` table.

In [2]:
import pandas as pd
import numpy as np
from datetime import date
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import relativedelta
from math import nan
import re


In [3]:
# import from CSV
phones = pd.read_csv("phone_dataset.csv")

# quick view of what the table contains
print('---------- Quick view of the phones dataset ----------')
print('Rows, columns:', phones.shape, '\n')
print(phones.head(15), '\n')


ParserError: Error tokenizing data. C error: Expected 40 fields in line 821, saw 41


We encountered this error while trying to import the data: 

> `ParserError: Error tokenizing data. C error: Expected 40 fields in line 821, saw 41`

This is due to errors introduced by the web scraping process. Maybe there are more rows with this problem, so we'll find a way to skip these problematic rows.

In [4]:
# import from CSV, warn of bad lines and then skip them
phones = pd.read_csv("phone_dataset.csv", on_bad_lines='warn')

# quick view of what the table contains
print('---------- Quick view of the phones dataset ----------')
print('Rows, columns:', phones.shape, '\n')
print(phones.head(15), '\n')


---------- Quick view of the phones dataset ----------
Rows, columns: (8628, 40) 

   brand                 model        network_technology  \
0   Acer         Iconia Talk S          GSM / HSPA / LTE   
1   Acer        Liquid Z6 Plus          GSM / HSPA / LTE   
2   Acer             Liquid Z6          GSM / HSPA / LTE   
3   Acer  Iconia Tab 10 A3-A40  No cellular connectivity   
4   Acer             Liquid X2          GSM / HSPA / LTE   
5   Acer         Liquid Jade 2          GSM / HSPA / LTE   
6   Acer      Liquid Zest Plus          GSM / HSPA / LTE   
7   Acer           Liquid Zest          GSM / HSPA / LTE   
8   Acer            Predator 8  No cellular connectivity   
9   Acer     Liquid Jade Primo          GSM / HSPA / LTE   
10  Acer           Liquid Z330          GSM / HSPA / LTE   
11  Acer           Liquid Z320                GSM / HSPA   
12  Acer          Liquid Z630S          GSM / HSPA / LTE   
13  Acer           Liquid Z630          GSM / HSPA / LTE   
14  Acer         

Skipping line 821: expected 40 fields, saw 41
Skipping line 6060: expected 40 fields, saw 41
Skipping line 6663: expected 40 fields, saw 41



There are only three problematic rows, so we can skip them for now.

In [5]:
# quick view of what columns we have
print(phones.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8628 entries, 0 to 8627
Data columns (total 40 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   brand               8628 non-null   object 
 1   model               8628 non-null   object 
 2   network_technology  8628 non-null   object 
 3   2G_bands            8628 non-null   object 
 4   3G_bands            4856 non-null   object 
 5   4G_bands            1603 non-null   object 
 6   network_speed       4883 non-null   object 
 7   GPRS                8596 non-null   object 
 8   EDGE                8605 non-null   object 
 9   announced           8613 non-null   object 
 10  status              8628 non-null   object 
 11  dimentions          8609 non-null   object 
 12  weight_g            7679 non-null   object 
 13  weight_oz           7679 non-null   object 
 14  SIM                 8627 non-null   object 
 15  display_type        8624 non-null   object 
 16  displa

We are only interested in phones that actually made it to the market, not ones that are cancelled, abandoned, coming soon, or something similar. In order to filter out phones like these, we need to find out all available kinds of `status`.

In [6]:
# refine 'status' column
# for index, value in phones['status'].items():
#     if 'Available' in value:
#         phones.at[index, 'status'] = 'Available'
#     elif 'Coming soon' in value:
#         phones.at[index, 'status'] = 'Coming soon'

# retrieve unique values of 'status'
# print('---------- Phone statuses ----------')
# print(phones['status'].unique().tolist(), '\n')
# ['Available', 'Discontinued', 'Coming soon', 'Cancelled']


# only include Available and Discontinued phones
for index, value in phones.iterrows():
    if 'Cancelled' in value['status']:
        phones.at[index, 'status'] = nan
    elif 'Coming soon' in value['status']:
        phones.at[index, 'status'] = nan

phones = phones.dropna(subset=['status'])

print('---------- Phones that are or were available in the market ----------')
print('Rows, columns:', phones.shape, '\n')
print(phones.head(15), '\n')


---------- Phones that are or were available in the market ----------
Rows, columns: (8380, 40) 

   brand                 model        network_technology  \
0   Acer         Iconia Talk S          GSM / HSPA / LTE   
1   Acer        Liquid Z6 Plus          GSM / HSPA / LTE   
2   Acer             Liquid Z6          GSM / HSPA / LTE   
3   Acer  Iconia Tab 10 A3-A40  No cellular connectivity   
4   Acer             Liquid X2          GSM / HSPA / LTE   
5   Acer         Liquid Jade 2          GSM / HSPA / LTE   
6   Acer      Liquid Zest Plus          GSM / HSPA / LTE   
7   Acer           Liquid Zest          GSM / HSPA / LTE   
8   Acer            Predator 8  No cellular connectivity   
9   Acer     Liquid Jade Primo          GSM / HSPA / LTE   
10  Acer           Liquid Z330          GSM / HSPA / LTE   
11  Acer           Liquid Z320                GSM / HSPA   
12  Acer          Liquid Z630S          GSM / HSPA / LTE   
13  Acer           Liquid Z630          GSM / HSPA / LTE   
14

We have a column called `announced` which is the month and year when a phone was announced. However, some phones don't have the exact month in which they're released, only the quarter. To accommodate for these we'll do the following column changes:
* Change `announced` to a date format for phones with an available month and date of announcement. The day will always be "01". This will be null for phones whose exact announcement month is unknown.
* Create `announced_year` and `announced_quarter` containing the year and quarter of announcement, respectively.

_Note:_ The `announced` values are more messy than I expected, so cleaning this up will take some time. For now, I got bored and moved on to other datasets.

In [23]:
# normalize months and quarters
month_to_num = {
    'January'   : '01',
    'February'  : '02',
    'March'     : '03',
    'April'     : '04',
    'May'       : '05',
    'June'      : '06',
    'July'      : '07',
    'August'    : '08',
    'September' : '09',
    'October'   : '10',
    'November'  : '11',
    'December'  : '12'
}

get_quarters = {
    '1Q'            : 'Q1',
    '2Q'            : 'Q2',
    '3Q'            : 'Q3',
    '4Q'            : 'Q4',
    'Q1'            : 'Q1',
    'Q2'            : 'Q2',
    'Q3'            : 'Q3',
    'Q4'            : 'Q4',
    'January'       : 'Q1',
    'February'      : 'Q1',
    'March'         : 'Q1',
    'April'         : 'Q2',
    'May'           : 'Q2',
    'June'          : 'Q2',
    'July'          : 'Q3',
    'August'        : 'Q3',
    'September'     : 'Q3',
    'October'       : 'Q4',
    'November'      : 'Q4',
    'December'      : 'Q4'
}


# these functions extract the year, month, and quarter from the original 'announced' column
def get_year_announced(x):
    if isinstance(x, (pd.Timestamp)):
        x.isoformat()
    try:
        year = re.search(r'(^\d{4})', x)
        year = int(year.group(1))
    except:
        return nan
    else:
        return year

def get_month_announced(x):
    try:
       month = re.search(r'^\d{4}\s*(\w*)', x)
       month = month_to_num[month.group(1)]
    except:
        return nan
    else:
        return month

def get_quarter_announced(x):
    try:
        quarter = re.search(r'^\d{4}\s*(\w*)', x)
        quarter = get_quarters[quarter.group(1)]
    except:
        return nan
    else:
        return quarter


# quick tests to see if the functions are working as expected
print(get_year_announced('Exp. announcement 2012  August'))
print(get_month_announced('2009  March. Released 2009  May'))
print(get_quarter_announced('2009  March. Released 2009  May'))



# create 'announced_year' column
phones['announced_year'[:]] = ''

for index, value in phones.iterrows():
    phones.at[index, 'announced_year'] = get_year_announced(value['announced'])


# create 'announced_quarter' value
phones['announced_quarter'[:]] = ''

for index, value in phones.iterrows():
    phones.at[index, 'announced_quarter'] = get_quarter_announced(value['announced'])


# convert 'announced' column to dates
for index, value in phones['announced'].items():
    if isinstance(value, (pd.Timestamp)):
        phones.at[index, 'announced'] = pd.to_datetime(value, format='%Y-%m-%d')
    elif type(value) is str:
        if 'Q' in str(value):
            phones.at[index, 'announced'] = ''
        else:
            try:
                phones.at[index, 'announced'] = pd.to_datetime(f'{get_year_announced(value)}-{get_month_announced(value)}-01', format='%Y-%m-%d')
            except:
                phones.at[index, 'announced'] = ''


# quick view of how 'announced' column values look like
# print(phones['announced'].unique().tolist(), '\n')

# quick view of what the table contains
print('---------- Quick view of the phones dataset ----------')
print(phones.head(15), '\n')


nan
03
Q1
---------- Quick view of the phones dataset ----------
   brand                 model        network_technology  \
0   Acer         Iconia Talk S          GSM / HSPA / LTE   
1   Acer        Liquid Z6 Plus          GSM / HSPA / LTE   
2   Acer             Liquid Z6          GSM / HSPA / LTE   
3   Acer  Iconia Tab 10 A3-A40  No cellular connectivity   
4   Acer             Liquid X2          GSM / HSPA / LTE   
5   Acer         Liquid Jade 2          GSM / HSPA / LTE   
6   Acer      Liquid Zest Plus          GSM / HSPA / LTE   
7   Acer           Liquid Zest          GSM / HSPA / LTE   
8   Acer            Predator 8  No cellular connectivity   
9   Acer     Liquid Jade Primo          GSM / HSPA / LTE   
10  Acer           Liquid Z330          GSM / HSPA / LTE   
11  Acer           Liquid Z320                GSM / HSPA   
12  Acer          Liquid Z630S          GSM / HSPA / LTE   
13  Acer           Liquid Z630          GSM / HSPA / LTE   
14  Acer          Liquid Z530S     

## The rise and fall of the audio jack

_To be continued after cleaning up the dataset..._