# Data Wrangling

Data never comes in the format we expect. So it is inevitable to clean and transform the data to make use of it.

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making. With the amount of data and data sources rapidly growing and expanding, it is getting increasingly essential for large amounts of available data to be organized for analysis.

Data wrangling involves processing the data in various formats and analyzes and get them to be used with another set of data and bringing them together into valuable insights. It further includes data aggregation, data visualization, and training statistical models for prediction.



In [3]:
##Super-fast and clean conversions to numbers.
!pip install fastnumbers

Collecting fastnumbers
  Downloading fastnumbers-3.2.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (82 kB)
[?25l[K     |████                            | 10 kB 20.2 MB/s eta 0:00:01[K     |████████                        | 20 kB 24.8 MB/s eta 0:00:01[K     |████████████                    | 30 kB 30.4 MB/s eta 0:00:01[K     |████████████████                | 40 kB 25.9 MB/s eta 0:00:01[K     |████████████████████            | 51 kB 20.8 MB/s eta 0:00:01[K     |███████████████████████▉        | 61 kB 23.7 MB/s eta 0:00:01[K     |███████████████████████████▉    | 71 kB 20.9 MB/s eta 0:00:01[K     |███████████████████████████████▉| 81 kB 22.2 MB/s eta 0:00:01[K     |████████████████████████████████| 82 kB 595 kB/s 
[?25hInstalling collected packages: fastnumbers
Successfully installed fastnumbers-3.2.1


In [4]:
##allows Python code  to "fix" invalid (X)HTML markup
!pip install pytidylib

Collecting pytidylib
  Downloading pytidylib-0.3.2.tar.gz (87 kB)
[?25l[K     |███▊                            | 10 kB 16.5 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 17.1 MB/s eta 0:00:01[K     |███████████▏                    | 30 kB 11.3 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 11.1 MB/s eta 0:00:01[K     |██████████████████▊             | 51 kB 6.7 MB/s eta 0:00:01[K     |██████████████████████▍         | 61 kB 7.8 MB/s eta 0:00:01[K     |██████████████████████████▏     | 71 kB 8.6 MB/s eta 0:00:01[K     |██████████████████████████████  | 81 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 2.8 MB/s 
[?25hBuilding wheels for collected packages: pytidylib
  Building wheel for pytidylib (setup.py) ... [?25l[?25hdone
  Created wheel for pytidylib: filename=pytidylib-0.3.2-py3-none-any.whl size=8564 sha256=a21cd9a7f9545e96351d0330a16baf44bfc927bdc6f7bbeeeb02c968ffbfc53f
  Stored in directory: /root/.

# Import required libraries

In [6]:
import numpy as np # linear algebra
import pandas as pd # pandas for dataframe based data processing and CSV file I/O
import requests # for http requests
from bs4 import BeautifulSoup # for html parsing and scraping
import bs4
from fastnumbers import isfloat 
from fastnumbers import fast_float
from multiprocessing.dummy import Pool as ThreadPool 

import matplotlib.pyplot as plt
import seaborn as sns
import json
from tidylib import tidy_document # for tidying incorrect html

sns.set_style('whitegrid')
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [7]:
#Convert string data to numbers using fast_float funtion from fastnumbers library
def ffloat(string):
    if string is None:
        return np.nan
    if type(string)==float or type(string)==np.float64:
        return string
    if type(string)==int or type(string)==np.int64:
        return string
    return fast_float(string.split(" ")[0].replace(',','').replace('%',''),
                      default=np.nan)

def ffloat_list(string_list):
    return list(map(ffloat,string_list))

def remove_multiple_spaces(string):
    if type(string)==str:
        return ' '.join(string.split())
    return string

# Sample HTTP get request

In [8]:
##Making Http Requests in Python

In [9]:
response = requests.get("http://www.example.com/", timeout=240)
response.status_code
response.content

200

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

In [10]:
url = "https://jsonplaceholder.typicode.com/posts/1"
response = requests.get(url, timeout=240)
response.status_code
response.json()

content = response.json()
content.keys()

200

{'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto',
 'id': 1,
 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
 'userId': 1}

dict_keys(['userId', 'id', 'title', 'body'])

In [11]:
##Scrape Data by Parsing and Traversing HTML
from IPython.core.display import HTML
HTML("<b>Rendered HTML</b>")


# Send web request and beauty the rendered response

In [12]:
response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
page_content = BeautifulSoup(response.content, "html.parser")
HTML(str(page_content.find("h1")))

content = BeautifulSoup(response.content, "html.parser")
#print(content)
price_div = content.find("div",attrs={"id":'sp_yearlylow'})
HTML(str(price_div))
price_div = content.find("div",attrs={"id":'sp_yearlyhigh'})
HTML(str(price_div))
price_div = content.find("div",attrs={"id":'sp_low'})
print('week low:')
HTML(str(price_div))
price_div = content.find("div",attrs={"id":'sp_high'})
print('week high:')
HTML(str(price_div))


import re
#pattern = r'[Hero MotoCorp: Well positioned to cruise smoothly]'# re.IGNORECASE makes the regex case-insensitive
#regex = re.compile(pattern, flags=re.IGNORECASE)

regex = re.compile(r'Hero.*')


for img in content.findAll('img'):       
    #print(img['src']+'\n')
    imgText = img.get('alt')
   
    if len(re.findall(regex, str(imgText))) > 0:
        print(imgText)
    

week low:


week high:


Buy Hero MotoCorp: target of Rs 3210: Sharekhan
Hero MotoCorp Q4 PAT may dip 36.4% YoY to Rs. 550 cr: ICICI Direct
I-T department detects multiple irregularities after raids on Hero Motocorp, others
Hero MotoCorp: An opportunity in tough times
