 # $\color{green}{\text{ Text Classification; Web Scraping  }}$


# Goal:  is to predict the artist from a piece of text. 

## Scrap the desired website

Web scraping = getting data from websites

In [418]:
import requests
import numpy as np
import random
import os
import pandas as pd

Requests allows us to send HTTP requests to website/servers. It sends back a response code, and the full html (if successful).

In [368]:
URL_1 = 'https://www.lyrics.com/artist/Nicki-Minaj/1049866'
URL_2 = 'https://www.lyrics.com/artist/Pink-Floyd/76669'


Most web server try to detect and block web scraping attempts.
To stay undetected you can try the following:
- Set a real user agent and other header to appear legit (headers can be found [Here](https://github.com/tamimibrahim17/List-of-user-agents): 

In [369]:
headers = {'User-agent': 'Mozilla/5.0 (X11; Linux i686; rv:2.0b10) Gecko/20100101 Firefox/4.0b10'}


​ Use a sleep or waiting time between requests, a computer can be much faster than a human:

In [366]:
import time
import numpy as np

In [370]:
#time.sleep(5)
response_1 = requests.get(url=URL_1, headers=headers)


In [371]:

response_2 = requests.get(url=URL_2, headers=headers)

In [32]:
response_1

<Response [200]>

In [372]:
response.status_code


200



* 200-range: successful
* 300-range: redirect
* 400-range: there was a problem with the client's request
* 500-range: there was a problem on the end of the server

Webpage itself is saved in the `.text` attribute.

In [387]:
Nicky_html = response_1.text
Pink_html = response_2.text

What do you notice:
* It's written in HTML (you don't need to know HTML though to scrape data!)
* HTML is structured using _tags_ (it also has _classes_ and _ids_)
* It is nested, has hierarchical structure

In [374]:
print(Nicky_html)

<!doctype html>
<html lang="en-US">
<head>
<meta name="theme-color" content="#830C66"/>

<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Nicki Minaj Lyrics, Songs and Albums | Lyrics.com</title>
<meta name="description" content="Nicki Minaj Lyrics - All the great songs and their lyrics from Nicki Minaj on Lyrics.com">
<meta name="keywords" content="Nicki Minaj lyrics, Nicki Minaj song lyrics, Nicki Minaj lyric">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<base href="https://www.lyrics.com/">
<script>
s4Prefix = 'https://static.stands4.com';
version = '1.4.11';
</script>

<link rel="preload" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css" as="style" />
<link rel="stylesheet" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css">

<link rel="preconnect" href="https://fonts.googleapis.com" crossorigin />
<link rel="dns-prefetch" href="https://fonts.googleapis

- What is this output?

It's written in HTML (you don't need to know HTML though to scrape data!)

[HTML Introduction](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)

- Each element consists of tags (opening tag and closing tag) and content.

- Some tags will have attributes which are like labels that are not displayed with the content but can help distinguish between different types of the same tag. 
- Important for this week: a-tags. 
<a href= </a>



### Writing/reading files in python

- This week we will be working a lot with text files and strings. We will be scraping a large amount of data, which it is important to only do once (and not keep re-running the scraping script, as this takes a long time and can get you banned).

Let's see how we can write in/read in a file in python.

- [Python documentation](https://docs.python.org/3/tutorial/inputoutput.html?highlight=open%20library#reading-and-writing-files)

modes:

* "w"- write / create content to a file
* "r" - read content from a file
* "a"- append content to a file


In [11]:
# w: write creates a file if it doesn't exist or overwrites it if it does:

In [375]:
f = open('Nicky_html.txt', 'w')
f.write(Nicky_html)
f.close
f1 = open('Pink_html.txt', 'w')
f1.write(Pink_html)
f1.close

<function TextIOWrapper.close()>

In [376]:
# r: read only, you can read in a file and save it as a variable in your code or print it
f = open('Nicky_html.txt', 'r')
Nicky_html_read=f.read()
#print(Nicky_html_read)
f.close()
f1 = open("Pink_html.txt", 'r')
Pink_html = f1.read()
f1.close()

In [None]:
# a: append mode adds text to the end of a file without overwriting it.

In [377]:
f=open('Nicky_html.txt','a')
f.write('blah blah blah')
f.close

<function TextIOWrapper.close()>

In [43]:
f = open('Nicky_html.txt', 'r')
Nicky_html_read=f.read()
print(Nicky_html_read)
f.close()

<!doctype html>
<html lang="en-US">
<head>
<meta name="theme-color" content="#830C66"/>

<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Nicki Minaj Lyrics, Songs and Albums | Lyrics.com</title>
<meta name="description" content="Nicki Minaj Lyrics - All the great songs and their lyrics from Nicki Minaj on Lyrics.com">
<meta name="keywords" content="Nicki Minaj lyrics, Nicki Minaj song lyrics, Nicki Minaj lyric">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<base href="https://www.lyrics.com/">
<script>
s4Prefix = 'https://static.stands4.com';
version = '1.4.10';
</script>

<link rel="preload" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css" as="style" />
<link rel="stylesheet" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css">

<link rel="preconnect" href="https://fonts.googleapis.com" crossorigin />
<link rel="dns-prefetch" href="https://fonts.googleapis

In [378]:
f = open('Pink_html.txt', 'r')
Pink_html_read=f.read()
print(Pink_html_read)
f.close()

<!doctype html>
<html lang="en-US">
<head>
<meta name="theme-color" content="#830C66"/>

<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Pink Floyd Lyrics, Songs and Albums | Lyrics.com</title>
<meta name="description" content="Pink Floyd Lyrics - All the great songs and their lyrics from Pink Floyd on Lyrics.com">
<meta name="keywords" content="Pink Floyd lyrics, Pink Floyd song lyrics, Pink Floyd lyric">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<base href="https://www.lyrics.com/">
<script>
s4Prefix = 'https://static.stands4.com';
version = '1.4.11';
</script>

<link rel="preload" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css" as="style" />
<link rel="stylesheet" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css">

<link rel="preconnect" href="https://fonts.googleapis.com" crossorigin />
<link rel="dns-prefetch" href="https://fonts.googleapis.com">

## Extract links

- re (regular expression)
- BeautifulSoup

### Inspecting a webpage

(You will need this to find links to individual songs in your artist's page.)

#### With re

In [45]:
import re

In [118]:
pattern=r'<a href="(/lyric/\d+/[A-z+/%0-9]+)">'



In [119]:
pattern

'<a href="(/lyric/\\d+/[A-z+/%0-9]+)">'

In [154]:
re.findall(pattern,Nicky_html_read)


['/lyric/36119616/Nicki+Minaj/All+Eyes+on+You',
 '/lyric/36635563/Nicki+Minaj/Bad+to+You',
 '/lyric/36084951/Nicki+Minaj/Goodbye',
 '/lyric/36574801/Nicki+Minaj/Fendi',
 '/lyric/36341190/Nicki+Minaj/Wobble+Up',
 '/lyric/36594597/Nicki+Minaj/Wobble+Up',
 '/lyric/36593451/Nicki+Minaj/iPHONE',
 '/lyric/36074923/Nicki+Minaj/Goodbye',
 '/lyric/36434349/Nicki+Minaj/Extravagant',
 '/lyric/36327210/Nicki+Minaj/Megatron',
 '/lyric/36461739/Nicki+Minaj/Slide+Around',
 '/lyric/36461734/Nicki+Minaj/Zanies+and+Fools',
 '/lyric/36646651/Nicki+Minaj/Tusa',
 '/lyric/36181607/Nicki+Minaj/Wobble+Up',
 '/lyric/35315169/Nicki+Minaj/Bang+Bang',
 '/lyric/35315124/Nicki+Minaj/Anaconda',
 '/lyric/35348208/Nicki+Minaj/Goodbye',
 '/lyric/35279467/Nicki+Minaj/Anybody',
 '/lyric/34991891/Nicki+Minaj/Barbie+Tingz',
 '/lyric/35143356/Nicki+Minaj/Bed',
 '/lyric/35005719/Nicki+Minaj/Ball+for+Me',
 '/lyric/35230492/Nicki+Minaj/Big+Bank',
 '/lyric/35230392/Nicki+Minaj/Boo%27d+Up',
 '/lyric/35260382/Nicki+Minaj/Boo%27d+

And then continue ... , Here the main focus is to do that with Beautifulsoup

#### With Beautifulsoup

In [379]:
from bs4 import BeautifulSoup

In [380]:
mkdir Minaj

In [388]:
mkdir Pink

In [389]:
nicky_soup = BeautifulSoup(Nicky_html, 'html.parser')

In [390]:
Pink_soup = BeautifulSoup(Pink_html,'html.parser')

Find links and name for the songs and store them  in a list

In [396]:
links_nicky=[]
song_tit_nicky = []
for a in nicky_soup.body.find_all('a', href=True):
     if 'lyric' in a['href']:
            links_nicky.append(a['href'])
            song_tit_nicky.append(a.text)
            #print(a['href'])

In [397]:
links_Pink=[]
song_tit_Pink = []
for a in Pink_soup.body.find_all('a', href=True):
     if 'lyric' in a['href']:
            links_Pink.append(a['href'])
            song_tit_Pink.append(a.text)
            #print(a['href'])

In [398]:
links_nicky.remove(links_nicky[0])
song_tit_nicky.remove(song_tit_nicky[0])

In [399]:
links_Pink.remove(links_Pink[0])
song_tit_Pink.remove(song_tit_Pink[0])

In [358]:
links_michael

['/lyric/36658277/Pink+Floyd/Lost+for+Words+%5BTour+Rehearsal+1994%5D',
 '/lyric/35727189/Pink+Floyd/Astronomy+Domine',
 '/lyric/35727186/Pink+Floyd/Set+the+Controls+for+the+Heart+of+the+Sun',
 '/lyric/35727184/Pink+Floyd/Flaming',
 '/lyric/35727177/Pink+Floyd/Jugband+Blues',
 '/lyric/34797174/Pink+Floyd/Ain%27t+That+Peculiar',
 '/lyric/33772133/Pink+Floyd/Crumbling+Land+%5BTake+1%5D+%5BZabriskie+Point%5D+%5B%23%5D%5BTake%5D',
 '/lyric/33773360/Pink+Floyd/Set+the+Controls+for+the+Heart+of+the+Sun+%5BAbbaye+de+Royaumont%5D',
 '/lyric/33773359/Pink+Floyd/Cymbaline+%5BAbbaye+de+Royaumont%5D',
 '/lyric/33773358/Pink+Floyd/Atom+Heart+Mother+%5BMusikforum+-+Austria%2C+01+July+1971%5D',
 '/lyric/33773353/Pink+Floyd/One+of+These+Days+%5BFrench+Windows%5D',
 '/lyric/33773352/Pink+Floyd/Atom+Heart+Mother+%5BMusikforum%2C+Colour%5D',
 '/lyric/33773351/Pink+Floyd/Atom+Heart+Mother+%5BHakone%5D',
 '/lyric/33773441/Pink+Floyd/Point+Me+at+the+Sky',
 '/lyric/33773440/Pink+Floyd/It+Would+Be+So+Nice',
 

In [296]:
song_tit_nicky

['All+Eyes+on+You',
 'Bad+to+You',
 'Goodbye',
 'Fendi',
 'Wobble+Up',
 'Wobble+Up',
 'iPHONE',
 'Goodbye',
 'Extravagant',
 'Megatron',
 'Slide+Around',
 'Zanies+and+Fools',
 'Tusa',
 'Wobble+Up',
 'Bang+Bang',
 'Anaconda',
 'Goodbye',
 'Anybody',
 'Barbie+Tingz',
 'Bed',
 'Ball+for+Me',
 'Big+Bank',
 'Boo%27d+Up',
 'Boo%27d+Up',
 'Chun-li',
 'Runnin',
 'Runnin',
 'MotorSport',
 'Poke+It+Out',
 'Fefe',
 'Mama',
 'Run+Up',
 'FEFE',
 'Goodbye',
 'Anybody',
 'Swalla',
 'No+Candle+No+Light',
 'Woman+Like+Me',
 'Woman+Like+Me',
 'Starships',
 'Side+to+Side',
 'Goodbye',
 'Super+Bass',
 'Ganja+Burns',
 'Ganja+Burn',
 'Majesty',
 'Barbie+Dreams',
 'Rich+Sex',
 'Hard+White',
 'Bed',
 'Thought+I+Knew+You',
 'Run+%26+Hide',
 'Chun+Swae',
 'Chun-Li',
 'LLC',
 'Good+Form',
 'Nip+Tuck',
 '2+Lit+2+Late+Interlude',
 'Come+See+About+Me',
 'Sir',
 'Miami',
 'Coco+Chanel',
 'Inspirations+Outro',
 'Fefe',
 'Ganja+Burns',
 'Majesty',
 'Barbie+Dreams',
 'Rich+Sex',
 'Hard+White',
 'Bed',
 'Thought+I+Knew+

Randomly choose 100 songs, find their lyrics and save them

In [400]:
i=0
while(i<100):
    #time.sleep()
    j=random.choice(range(1, len(song_tit_nicky))) #avoid replacement by deleting the chosen ones think here for the limitations ...
    URL="https://www.lyrics.com/"+links_nicky[j]
    lyrics_nicky=requests.get(url=URL, headers=headers)
    lyrics_nicky_html=lyrics_nicky.text
    lyric_nicky_soup=BeautifulSoup(lyrics_nicky_html, 'html.parser')
    print(URL)
    if lyric_nicky_soup.find(id="lyric-body-text") is None:
        continue
    lyrics_pure=lyric_nicky_soup.body.find(id="lyric-body-text").text
    with open("./Minaj/"+song_tit_nicky[j], 'w') as f:
        f.write(lyrics_pure)
        #f.close()
    i+=1
    
    

https://www.lyrics.com//lyric/28691119/Nicki+Minaj/Pound+the+Alarm
https://www.lyrics.com//lyric/26240535/Nicki+Minaj/Turn+Me+On
https://www.lyrics.com//lyric/36429087/Nicki+Minaj/Baps
https://www.lyrics.com//lyric/33873935/Nicki+Minaj/Light+My+Body+Up
https://www.lyrics.com//lyric/19850326/Nicki+Minaj/Kill+Da+DJ
https://www.lyrics.com//lyric/35372977/Nicki+Minaj/2+Lit+2+Late+Interlude
https://www.lyrics.com//lyric/36061083/Nicki+Minaj/Senile
https://www.lyrics.com//lyric/26329758/Nicki+Minaj/Pound+the+Alarm
https://www.lyrics.com//lyric/36070750/Nicki+Minaj/Muny
https://www.lyrics.com//lyric/31687001/Nicki+Minaj/Y%27a+Rien+%C3%A0+Faire
https://www.lyrics.com//lyric/35185306/Nicki+Minaj/MotorSport
https://www.lyrics.com//lyric-lf/10362200/Nicki+Minaj/Ucci+Ucci
https://www.lyrics.com//lyric/33971601/Nicki+Minaj/Entertainment
https://www.lyrics.com//lyric/22197336/Nicki+Minaj/Bottoms+Up+%5BVideo%5D+%5B%2A%5D%5BMultimedia+Track%5D
https://www.lyrics.com//lyric-lf/3362699/Nicki+Minaj/Dopem

In [401]:
i=0
while(i<100):
    #time.sleep()
    j=random.choice(range(1, len(song_tit_Pink))) #avoid replacement by deleting the chosen ones think here for the limitations ...
    URL="https://www.lyrics.com/"+links_Pink[j]
    lyrics_Pink=requests.get(url=URL, headers=headers)
    lyrics_Pink_html=lyrics_Pink.text
    lyric_Pink_soup=BeautifulSoup(lyrics_Pink_html, 'html.parser')
    print(URL)
    if lyric_Pink_soup.find(id="lyric-body-text") is None:
        continue
    lyrics_pure=lyric_Pink_soup.body.find(id="lyric-body-text").text
    with open("./Pink/"+song_tit_Pink[j], 'w') as f:
        f.write(lyrics_pure)
        #f.close()
    i+=1

https://www.lyrics.com//lyric/32973969/Pink+Floyd/Corporal+Clegg
https://www.lyrics.com//lyric/3447811/Pink+Floyd/See+Emily+Play
https://www.lyrics.com//lyric/395903/Pink+Floyd/Sheep
https://www.lyrics.com//lyric/33335843/Pink+Floyd/Run+Like+Hell
https://www.lyrics.com//lyric/3447800/Pink+Floyd/Learning+to+Fly
https://www.lyrics.com//lyric/6329938/Pink+Floyd/Interstellar+Overdrive
https://www.lyrics.com//lyric/24585379/Pink+Floyd/Eclipse
https://www.lyrics.com//lyric/24440822/Pink+Floyd/Shine+on+You+Crazy+Diamond%2C+Pts.+1-5
https://www.lyrics.com//lyric/5030224/Pink+Floyd/Scarecrow
https://www.lyrics.com//lyric/33899894/Pink+Floyd/Introduction+%5BLive+in+Stockholm+1967%5D
https://www.lyrics.com//lyric/12384488/Pink+Floyd/Proper+Education
https://www.lyrics.com//lyric/33899748/Pink+Floyd/Point+Me+at+the+Sky+%5BRestored+Promo+Video%5D
https://www.lyrics.com//lyric/30606309/Pink+Floyd/Another+Brick+in+the+Wall%2C+Pt.+2+%5BThe+Wall+Work+in+Progress+Pt.+1%2C+1979
https://www.lyrics.com//ly

https://www.lyrics.com//lyric/30604336/Pink+Floyd/Childhood%27s+End
https://www.lyrics.com//lyric/12228785/Pink+Floyd/Chapter+24+%5BStereo%5D
https://www.lyrics.com//lyric/12201145/Pink+Floyd/Take+Up+Thy+Stethoscope+and+Walk
https://www.lyrics.com//lyric/13463665/Pink+Floyd/Let+There+Be+More+Light
https://www.lyrics.com//lyric/33773353/Pink+Floyd/One+of+These+Days+%5BFrench+Windows%5D
https://www.lyrics.com//lyric/12228767/Pink+Floyd/Lucifer+Sam
https://www.lyrics.com//lyric/13463763/Pink+Floyd/Run+Like+Hell
https://www.lyrics.com//lyric/22706473/Pink+Floyd/Proper+Education
https://www.lyrics.com//lyric/1831105/Pink+Floyd/Dogs
https://www.lyrics.com//lyric/19567987/Pink+Floyd/Money
https://www.lyrics.com//lyric/33899906/Pink+Floyd/Butterfly
https://www.lyrics.com//lyric/25130096/Pink+Floyd/Don%27t+Leave+Me+Now
https://www.lyrics.com//lyric/14269954/Pink+Floyd/Point+Me+at+the+Sky
https://www.lyrics.com//lyric/11939006/Pink+Floyd/Proper+Education+%5BOriginal+Version%5D
https://www.lyrics

## Data preprocessing for modelling

#### Build lyrics corpuse for your songs using the saved lyrics in Pink and Minaj directories

In [430]:
corpus=[]
Name_of_artists=[]
for files in os.listdir("./Minaj/"):
    with open ('./Minaj/'+files, "r") as f:
        corpus.append(f.read())
        Name_of_artists.append("Minaj")

["It was back in '07 did a couple of tapes\nDid a couple DVD's made a couple mistakes\nDidn't know what I was doing, but I put on a cape\nNow it's which World Tour should I go on and take\nSee you told me I would lose but I won\nI might cop a million Jimmy Choo's just for fun\n'Cause bitches couldn't take what was in me\nAustralia, Sydney\nMight run up in Disney, out in L.A. with Lindsey\nGot the eye of the tiger, the lion of Judah\nNow it's me in my time, it's just me in my prime\nEverything I tried to teach 'em, they gone see it in time\nTell them bitches get a stick, I'm done leadin' the blind\nGot two shows tonight, that's Brooklyn and Dallas\nThen a private party at the Buckingham palace\nWhich means I gotta fly like a movie no commercial\nThat's Young Money, Cash Money yeah I'm Universal\n\nI hear they comin' for me\nBecause the top is lonely\nWhat the f*ck they gon' say\nWhat the f*ck they gon' say\nI'm the best bitch doin' it, doin' it\nI'm the best bitch doin' it, doin' it\nI'

In [431]:
for files in os.listdir("./Pink/"):
    with open ("./Pink/"+files, 'r') as f1:
        corpus.append(f1.read())
        Name_of_artists.append("Pink")

["Wasting my time, \nResting my mind \nAnd I'll never pine \nFor the sad days and the bad days \nWhen we was workin' from nine to five. \nAnd if you don't mind \nI'll spend my time \n\nHere by the fire side \nIn the warm light and the love in her eyes. \nAnd if you don't mind \nI'll spend my time \nHere by the fire side \nIn the warm light of her eyes",
 "You gotta be crazy, you gotta have a real need\nYou gotta sleep on your toes, and when you're on the street\nYou gotta be able to pick out the easy meat with your eyes closed\nAnd then moving in silently, down wind and out of sight\nYou gotta strike when the moment is right without thinking\n\nAnd after a while, you can work on points for style\nLike the club tie, and the firm handshake\nA certain look in the eye and an easy smile\nYou have to be trusted by the people that you lie to\nSo that when they turn their backs on you,\nYou'll get the chance to put the knife in\n\nYou gotta keep one eye looking over your shoulder\nYou know it'

#### Vectorizer

In [432]:
from sklearn.feature_extraction.text import CountVectorizer

In [433]:
vectorizer = CountVectorizer(stop_words="english")

In [436]:
matrix = vectorizer.fit_transform(corpus)

In [437]:
matrix_pink.todense() # justt to see 

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 1],
        ...,
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

['Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Minaj',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink'

#### Normalisation using TF-IDF

In [450]:
from sklearn.feature_extraction.text import TfidfTransformer 

In [453]:
tf= TfidfTransformer()

In [454]:
matrix_t=tf.fit_transform(matrix)

In [459]:
df=pd.DataFrame(matrix_t.todense(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,07,09,10,100,12,129,150,16,1st,20,...,yuh,yung,yup,zanotti,zero,zig,zolanski,zombie,zone,zoomed
0,0.035664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
150,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
152,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### FIT TO LOGISTICREGRESSION

In [460]:
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split


In [461]:
X=df.values
y=Name_of_artists
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=42)

In [463]:
m = LogisticRegression()
m.fit(X_train, y_train)

In [465]:
m.score(X_train, y_train)

1.0