### Problem Statement: 
Scrap websites and automate extraction and processing of baby names according to numerology rules to find valid baby names as per the given birth number. 

Skills expected to be learnt:
> 1) Better understanding of lists, dictionaries usages along with associated methods

> 2) Better understanding of for and while loops along with conditional statements

> 3) Basic understanding of libraries like requests, bs4, re

> 4) Webscraping basics, text manipulation basics, basic file i/o


In [7]:
#importing required libraries

#Requests will allow you to send HTTP/1.1 requests using Python
import requests

#Beautiful Soup helps us to pull data out of HTML data
from bs4 import BeautifulSoup

#re is a powerful text manipulation library; abbrevated as regular expression
import re

In [8]:
#website that has the baby names
URL = 'https://www.babycenter.in/a25010193/modern-indian-baby-names'

#why have we defined HTTP headers? Some websites look to get some information about the people that access their site; those information can be colllected from either cookies or preliminarily from headers; to know more: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

In [9]:
#we are requesting the webpage (URL) using the defined headers

page = requests.get(URL, headers=headers)

#printing page must return status code 200 if the request was a success; 404 if the request was failure; use page.content to view bytes data
#With this soup object, you can navigate and search through the HTML for data that you want. 
#For example, if you run soup.title after the previous code in a Python shell you'll get the title of the web page. If you run print(soup.get_text()), you will see all of the text on the page

soup = str(BeautifulSoup(page.content, 'html.parser'))

#here its easier for me to identify patters in HTML data, so we're not extracting text from HTML, but rather using it to our advantage; We use re.findall()
#we use re.findall() to find all the text inbetween any given text pattern
#here our required text lies inbetween <a href="/babyname/  and  </a>

patt = "<a href=\"/babyname/(.*?)</a>"
reout = re.findall(patt,soup)


<Response [200]>


In [4]:
#defining a function findnum(ch) that accepts one mandatory argument

def findnum(ch):
    
    #dictionary with lists of alphabets assigned to keywords
    
    dict1 = {1:['a', 'j', 's'],2:['b', 'k', 't'],3:['c', 'l', 'u'],4:['d', 'm', 'v'],5:['e', 'n', 'w'],6:['f', 'o', 'x'],7:['g', 'p', 'y'],8:['h', 'q', 'z'],9:['i','r']}

    #using the dict.keys() and dict.values() methods to get two lists made out of the key values and element values of dict1
    
    key_list = list(dict1.keys()) 
    val_list = list(dict1.values())

    #finding which key is my element ch assigned to
    
    for i in range(0, len(val_list)):
        if(ch in val_list[i]):
            #print(val_list.index(val_list[i]))
            #print(key_list[val_list.index(val_list[i])])
            return(key_list[val_list.index(val_list[i])])


In [5]:
#function namer(reout, num) with two mandatory arguments 

def namer(reout, num):
    
    #flag is used to find when the second wave of babynames (girl names) start
    flag = 0
    
    #iterating over a list reout, with i as iterator; since reout is a list of strings, every i value will be a string
    
    for i in reout:
        
        #assigning sliced i to a temp value; i is being sliced from the end of '>' character till the end and thus, we find index of '>' using i.find('>')
        #string slicing and list slicing work similar; list[3:5] returns the values in index 3 and 4; list[3:] returns all elements in index from 3 to 'end of list'
        
        temp = i[int(i.find('>'))+1:]
        
        #temp.lower() is used to convert the names to lower case; remember in dict1, we only have keys assigned to lower case letters
        
        temp = temp.lower()
        
        #flag is set to 1 once the baby names stop getting baby names starting with 'a'
        
        if( temp[0] != 'a' ):
            flag = 1
        
        #once flag is 1, we start monitoring when the baby names temp will get the names starting from 'a' again. We break the loop, once it happens
        
        if( (flag == 1) and (temp[0] == 'a') ):
            break
        
        #whenever the list is not broken, the names are appedned to a list 'split_list'
        #temp.strip() trims trailing whitespaces, list() converts the string to individual digits and puts in a list eg. ['r','a','j']
        
        split_list.append(list(temp.strip()))
        
    ###################
    #in this part of code, let's find digit summation value of the baby names
    
    #ret_list is used to store all baby names that satisfy the condition of digit summation being equal to number 'num'
    
    ret_list = []
    
    for i in split_list:
        tempval = 0
        for j in i:
             
            tempval = tempval+ findnum(j)
            #print(str(j) + " " + str(findnum(j)) + " " + str(tempval))
        
        tempval2 = tempval
        
        while(tempval2>9):
            sum_of_digits = 0
            for digits in str(tempval2):
                sum_of_digits = sum_of_digits + int(digits) 
            tempval2 = sum_of_digits
        
        if(tempval2 == num):

            # "".join() method joins the elements in a list and returns as one string
            ret_list.append("".join(i))
    
    return ret_list
    

In [6]:
#print all baby names according to the number that they satisfy
for i in range(1,10):
    print(str(i),end = ": ")
    print(namer(reout,i))

#print to a file 'out.txt', the names that satisfy the number given; here it is 6
with open("out.txt","w") as f:
    f.write(",".join(namer(reout,6)))


1: ['chirag', 'devansh', 'dhruv', 'divit', 'himmat', 'hridaan', 'ivan', 'jivin', 'taimur', 'tejas', 'vihaan']
2: ['advik', 'arnav', 'badal', 'bhavin', 'krish', 'manikya', 'miraan', 'priyansh', 'rohan', 'shayak', 'shlok', 'umang', 'vaibhav', 'vidur']
3: ['aayush', 'armaan', 'divyansh', 'ehsaan', 'farhan', 'gatik', 'gokul', 'pranay', 'raghav', 'raunak', 'yakshit', 'yuvaan']
4: ['akarsh', 'fateh', 'indrajit', 'kanav', 'madhav', 'purab', 'romil', 'ryan', 'sahil', 'sumer']
5: ['aarush', 'anay', 'azad', 'dhanuk', 'faiyaz', 'hansh', 'hiran', 'jayesh', 'kabir', 'lakshay', 'mehul', 'nakul', 'onkar', 'veer', 'zain']
6: ['nishith', 'prerak', 'tushar', 'vivaan', 'zeeshan']
7: ['aarav', 'aniruddh', 'arhaan', 'darshit', 'ishaan', 'kartik', 'nirvaan', 'riaan', 'samar', 'shaan', 'shamik', 'stuvan', 'uthkarsh', 'yuvraj']
8: ['abram', 'hunar', 'jayant', 'lagan', 'lakshit', 'ranbir', 'ritvik', 'samarth', 'shalv', 'shray']
9: ['divij', 'emir', 'indranil', 'kiaan', 'madhup', 'ojas', 'reyansh', 'saksham', '

Of course numerology is'nt sciene; but it did put up a good (fun) real word use case for us to try our hands on lists, dictionaries, list methods like join, append,..
We used for loops and while loops, knowing well which scenarios the two are used in;
We had a glimpse of how functions are used, how to write to files;
We understood, how temp variables and flags are set; also, we had a overview of how reuests, bs4 and re libraries work and what they are capable of!

Sounder Rajendran | +91-9080910468 | emailtorsounder@gmail.com | Reach out for any queries

   