# Asalam Alaikum, Hello and welcome to this jupyter notebook

### This is some practical work from A - Z (getting data to visualizing the useful insights)

### I will get you step by step through this project to get you a better understanding of how I have done it

#### This project is based on books information, which I have collected this information from wikipedia website from the link below.
#### This process starts with scraping data from wikipedia (List_of_best-selling_books) page, which has some tables containing information about different type of books and other useful information about the books

### So let's get started

#### 1st: we will scrape the data from web source, so we have imported the neccessary libraries, urllib for requesting and getting the html data, and BeautifulSoup for processing the html content. Pandas for data cleaning, analyzing of the data, and numpy for some calculations

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np

#### we will request to the url

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_books'
page = urlopen(url)

In [None]:
html_bytes = page.read()
html = html_bytes.decode('utf-8')

In [None]:
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]

In [None]:
title

#### We will create an object of BeautifulSoup, and parse the html content

In [None]:
soup = BeautifulSoup(html, 'html.parser')

In [None]:
soup

In [None]:
soup.title.text

In [None]:
len(soup.select('img'))

In [None]:
soup.title.string

In [None]:
soup.select('tbody')[0]

#### As you may see we can use the BeautifulSoup object to navigate through html elements so easily, and get attributes and content of the element

#### Now that we have scraped and parsed the data, the problem is that the data is so dirty. So, we should clean it first. first of all we will start getting the data from html tables and combine them and then eliminatethe \n from whole data. we will start with cleaning and chaging the data to pandas DataFrame object

In [None]:
def splitSlashN(i):
    return i.split('\n')[0]

In [None]:
def getTableHeader(table):
    headings = []
    for th in table.tr:
        headings.append(th.text)

    try:
        while 1==1:
            headings.remove('\n')
    except:
        pass

    headings = list(map(splitSlashN, headings))
    
    return headings

#### In here we have the method to get table's data and combine the whole data because the data were fetched in many tables and we should combine all of them

In [None]:
def getTableData(table):
    table_data = []
    table_header = getTableHeader(table)

    for row in table.findAll('tr'):
        row_data = []    

        for cell in row.findAll('td'):
            row_data.append(cell.text)  

        if(len(row_data) > 0):
            data_item = {"Book": row_data[0],
                     "Author(s)": row_data[1],
                     "Original language": row_data[2]
            }
            
            i = 3;
            
            if (('No. of installments' in table_header or 'No. of instalments' in table_header) and row_data[i] != None):
                data_item['No. of installments'] = row_data[i];
                i = i + 1
            else:
                data_item['No. of installments'] = None
                
            data_item["First published"] = row_data[i] if row_data[i] != None else None
            i = i + 1
            data_item["Approximate sales"] = row_data[i] if row_data[i] != None else None
            i = i + 1
            
            try:
                if ('Genre' in table_header):
                    if (row_data[i] != None):
                        data_item["Genre"] = row_data[i]
                    else:
                        data_item['Genre'] = None
                else:
                    data_item['Genre'] = None
            except:
                data_item['Genre'] = None
            
            
            table_data.append(data_item)
                
    return table_data

In [None]:
book_tables = soup.findAll('tbody')
book_tables.pop()
len(book_tables)

#### we will call the getTableData method to get every html table's data

In [None]:
all_books = []

for table in book_tables:
    all_books.append(getTableData(table))

In [None]:
all_books[7]

#### We will convert the python list (which is populated with all of the data from html table's) into pandas DataFrame

In [370]:
books_df = pd.DataFrame()

for table in all_books:
    table = pd.DataFrame(table)
    books_df = pd.concat([books_df, table]).reset_index(drop=True)
    
books_df

Unnamed: 0,Book,Author(s),Original language,No. of installments,First published,Approximate sales,Genre
0,A Tale of Two Cities,Charles Dickens,English,,1859,200 million[19][circular reporting?]\n,Historical fiction\n
1,The Little Prince (Le Petit Prince),Antoine de Saint-Exupéry,French,,1943,200 million[20][21]\n,Novella\n
2,Harry Potter and the Philosopher's Stone,J. K. Rowling,English,,1997,120 million[22][23]\n,Fantasy\n
3,And Then There Were None,Agatha Christie,English,,1939,100 million[24]\n,Mystery\n
4,Dream of the Red Chamber (紅樓夢),Cao Xueqin,Chinese,,1791,100 million[25][26]\n,Family saga\n
...,...,...,...,...,...,...,...
323,"旺文社古語辞典 (Obunsha Kogo Jiten) ""Obunsha Dictiona...",Akira Matsumura,Japanese,,1960,11 million[352]\n,
324,Hammond's Pocket Atlas\n,\n,English\n,,(Up to 1965)\n,11 million[353]\n,
325,"三省堂国語辞典 (Sanseido Kokugo Jiten) ""Sanseido Dict...",Kenbō Hidetoshi,Japanese,,1960,10 million[354]\n,
326,家庭に於ける実際的看護の秘訣 (Katei Ni Okeru Jissaiteki Kang...,Takichi Tsukuda,Japanese,,1925,10 million[355]\n,


#### We will eliminate the \n, as its visible all over the dataset 

In [371]:
def cleanColumnsSlashN(value):
    if (value is not None):
        return value.split('\n')[0]
    
    return None

books_df['Book'] = books_df['Book'].apply(lambda x: cleanColumnsSlashN(x))
books_df['Author(s)'] = books_df['Author(s)'].apply(lambda x: cleanColumnsSlashN(x))
books_df['Original language'] = books_df['Original language'].apply(lambda x: cleanColumnsSlashN(x))
books_df['No. of installments'] = books_df['No. of installments'].apply(lambda x: cleanColumnsSlashN(x))
books_df['First published'] = books_df['First published'].apply(lambda x: cleanColumnsSlashN(x))
books_df['Approximate sales'] = books_df['Approximate sales'].apply(lambda x: cleanColumnsSlashN(x))
books_df['Genre'] = books_df['Genre'].apply(lambda x:cleanColumnsSlashN(x))

#### We will now create some useful functions to elimiate extra characters and to remain only the number in Approximate sales column so that we can work with numbers easily in analysis

In [372]:
def getSaleAmount(sale):
    if (sale is not None):
        s = sale.split(' ')
        if (s[0][0].isdigit() or s[0][0] == '>'):
            return s[0]
        elif (s[1][0].isdigit()):
            return s[1]
        elif (s[2][0].isdigit()):
            return s[2]
        
    return None

def removeExtraCharacters(sale):
    if (sale is not None):
        s = sale.split('>')
        if (len(s) > 1):
            return s[1]
        return sale
    return None

def removeExtraBrackets(sale):
    if (sale is not None):
        s = sale.split('[')
        return s[0]
    return None

def removeDashFromSale(sale):
    if (sale is not None):
        s = sale.split('–')
        if(len(s) > 1):
            s = list([pd.to_numeric(s[0]), pd.to_numeric(s[1])])
            sales = (s[1] + s[0])/2;
            return sales
        else:
            return sale 
    return None

In [373]:
books_df['Approximate sales'] = books_df['Approximate sales'].apply(lambda x: getSaleAmount(x)) 

In [374]:
books_df['Approximate sales'] = books_df['Approximate sales'].apply(lambda x: removeExtraCharacters(x)) 

In [375]:
books_df['Approximate sales'] = books_df['Approximate sales'].apply(lambda x: removeExtraBrackets(x)) 

In [376]:
books_df['Approximate sales'] = books_df['Approximate sales'].apply(lambda x: removeDashFromSale(x)) 

In [377]:
books_df['Approximate sales'] = pd.to_numeric(books_df['Approximate sales'])

#### Now we have eliminated the extra charaters from Approximate sales column and coverted it to numeric

In [378]:
books_df.loc[296]

Book                     Scouting for Boys
Author(s)              Robert Baden-Powell
Original language                  English
No. of installments                   None
First published                       1908
Approximate sales                    125.0
Genre                                 None
Name: 296, dtype: object

In [None]:
books_df.loc[300:340]

In [None]:
books_df.describe()

#### We will create a new column as for every book it has a specific type, as you can see it in the website it self

In [379]:
books_df['Book type'] = None

In [380]:
books_df['Book type'].loc[0:172] = 'Individual'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [382]:
books_df['Book type'].loc[173:294] = 'Series'

In [384]:
books_df['Book type'].loc[295:] = 'Regularly Updated'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [361]:
books_df.sample(10)

Unnamed: 0,Book,Author(s),Original language,No. of installments,First published,Approximate sales,Genre,Book type
267,Découvertes Gallimard,Various authors,French,more than 700,1986–present,20.0,,Series
211,Anpanman (アンパンマン),Takashi Yanase,Japanese,150 picture books,1973–2013,80.0,,Series
101,The Secret,Rhonda Byrne,English,,2006,20.0,Self-help,Individual
285,吸血鬼ハンターD (Vampire Hunter D),Hideyuki Kikuchi,Japanese,39,1983–present,17.0,,Series
129,A Wrinkle in Time,Madeleine L'Engle,English,,1962,14.0,,Individual
312,自由自在 (Jiyu Jizai),Various authors,Japanese,,1953–present,24.0,,Regularly Updated
68,The Wind in the Willows,Kenneth Grahame,English,,1908,25.0,Children's literature,Individual
319,Merriam-Webster Pocket Dictionary,,English,,(Up to 1965),15.11,,Regularly Updated
125,The Hitchhiker's Guide to the Galaxy,Douglas Adams,English,,1979,14.0,Science fiction,Individual
172,Bridget Jones's Diary,Helen Fielding,English,,1996,10.0,,Individual


#### Its time to start working on date column, we will start with creating one more column (Last published) to cut the extra part of first published and make it easier to convert it to date, and we will take the other half of the first published and store it in newly created to column for future uses

In [385]:
def getLastPublishing(date):
    if (date is not None):
        d = date.split('–')
        if (len(d) > 1):
            return d[1]
        
    return None

In [386]:
books_df['Last published'] = books_df['First published'].apply(lambda x: getLastPublishing(x))

In [387]:
def getFirstPublishing(date):
    if (date is not None):
        d = date.split('–')
        return d[0]
        
    return None

In [388]:
books_df['First published'] = books_df['First published'].apply(lambda x: getFirstPublishing(x))

In [397]:
books_df['First published'].loc[319] = '1965'
books_df['First published'].loc[324] = '1965'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [398]:
books_df[books_df['First published'].str.contains('Up to')]

Unnamed: 0,Book,Author(s),Original language,No. of installments,First published,Approximate sales,Genre,Book type,Last published


In [343]:
books_df[books_df['First published'].str.contains('Up to')]

Unnamed: 0,Book,Author(s),Original language,No. of installments,First published,Approximate sales,Genre,Book type,Last published


In [402]:
books_df['First published'].loc[303] = books_df['First published'].loc[303].split('(')[0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [409]:
books_df['First published'] = pd.to_datetime(books_df['First published'], errors='ignore')

In [413]:
books_df.sample(20)

Unnamed: 0,Book,Author(s),Original language,No. of installments,First published,Approximate sales,Genre,Book type,Last published
107,Matilda,Roald Dahl,English,,1988,17.0,Children's Literature,Individual,
276,Rainbow Magic,Daisy Meadows,English,80+,2003,20.0,,Series,present
171,The Story of My Experiments with Truth (સત્યના...,Mohandas Karamchand Gandhi,Gujarati,,1925,10.0,,Individual,1929
305,Roget's Thesaurus,Peter Mark Roget,English,,1852,40.0,,Regularly Updated,
161,The Dukan Diet,Pierre Dukan,French,,2000,10.0,,Individual,
99,Where the Wild Things Are,Maurice Sendak,English,,1963,20.0,Children's picture book,Individual,
77,Love Story,Erich Segal,English,,1970,21.0,Romance novel,Individual,
131,The Old Man and the Sea,Ernest Hemingway,English,,1952,13.0,,Individual,
71,The Celestine Prophecy,James Redfield,English,,1993,23.0,New-age spiritual novel,Individual,
155,Night (Un di Velt Hot Geshvign),Elie Wiesel,Yiddish,,1958,10.0,,Individual,
