<h1 align=center><font size = 5>String Operations in PYTHON</font></h1>

<br>

This notebook will provide information regarding reading text files and performing various operations on Strings.

## Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">
<li><a href="#ref0">About the Dataset</a></li>
<li><a href="#ref1">Reading Text Files</a></li>
<li><a href="#ref2">String Operations</a></li>
</div>

<hr>

<a id="ref0"></a>
<h2 align=center>About the Dataset</h2>

In this module, we are going to use the ``Thriller.txt`` file. This file contains text data which is basically summary of the **Thriller** album and thereafter going to perform various operations on this data.

This is how our data looks:

<font face="courier new" size="2">Thriller is the sixth studio album by the American singer Michael Jackson, released on November 30, 1982 by Epic Records. It is the follow-up to Jackson's critically and commercially successful fifth studio album Off the Wall (1979). Thriller explores genres similar to those of Off the Wall, including pop, post-disco, rock and funk. Recording sessions for Thriller took place from April to November 1982 at Westlake Recording Studios in Los Angeles, California with a production budget of \$750,000 (US \$1,842,155 in 2016 dollars[1]), assisted by producer Quincy Jones. Of the album's nine tracks, four were written by Jackson. Seven singles were released from the album, all of which reached the top 10 on the Billboard Hot 100 chart. Three of the singles had music videos released. "Baby Be Mine" and "The Lady in My Life" were the only tracks that were not released as singles from the album.

[1] Federal Reserve Bank of Minneapolis Community Development Project. "Consumer Price Index (estimate) 1800–". Federal Reserve Bank of Minneapolis. Retrieved October 21, 2016.</font>

Source: https://en.wikipedia.org/wiki/Thriller_(Michael_Jackson_album)

<hr>

<a id="ref1"></a>
<h2 align=center>Reading Text Files</h2>

First we use <`wget -O destination source`> to download the text file from the web. We store this file to the path `/resources/data/Thriller.txt`

After storing the file on disk, we use `open(path,mode)` to return a file object. Mode indicates what permissions the file object has, i.e. Read/Write `rw` etc.

We store the file object to a variable named `file`. 

In [1]:
import pandas as pd

In [2]:
file = open("dataset/thriller.txt","r", encoding="mbcs")
print("Name of the file:" , file.name)

Name of the file: dataset/thriller.txt


To read the file, we invoke `readlines()` on the file variable. Lets store this text file to the `summary` variable. Next, lets print out the text.

In [3]:
summary = file.readlines()
print ("Read Line: %s" % (summary)) 

Read Line: ["Thriller is the sixth studio album by the American singer Michael Jackson, released on November 30, 1982 by Epic Records. It is the follow-up to Jackson's critically and commercially successful fifth studio album Off the Wall (1979). Thriller explores genres similar to those of Off the Wall, including pop, post-disco, rock and funk. Recording sessions for Thriller took place from April to November 1982 at Westlake Recording Studios in Los Angeles, California with a production budget of 750,000(ð\x9d‘ˆð\x9d‘†\n", '\n', '1,842,155 in 2016 dollars[1]), assisted by producer Quincy Jones. Of the album\'s nine tracks, four were written by Jackson. Seven singles were released from the album, all of which reached the top 10 on the Billboard Hot 100 chart. Three of the singles had music videos released. "Baby Be Mine" and "The Lady in My Life" were the only tracks that were not released as singles from the album.\n', '\n', '[1] Federal Reserve Bank of Minneapolis Community Developm

It's always good practice to close a file after you're done reading it. We invoke the `close()` command to do that. 

In [4]:
file.close()

<hr>

<a id="ref2"></a>
<h2 align=center>String Operations</h2>

There are many string operation methods in Python that can be used to manipulate the data. We are going to use some basic string operations on the data that we read before. In this lession, we go over the following: 
- len()
- lower(),upper()
- replace()
- split()
- count()
- sort()
- String slicing

<h3 style="font-size:120%">len()</h3>

The first function is ``len()`` which return the total number of characters in the given string. Let's find out how many characters are there in the first line of the ``summary`` variable.

In [5]:
print(summary[0])
len(summary[0])              

Thriller is the sixth studio album by the American singer Michael Jackson, released on November 30, 1982 by Epic Records. It is the follow-up to Jackson's critically and commercially successful fifth studio album Off the Wall (1979). Thriller explores genres similar to those of Off the Wall, including pop, post-disco, rock and funk. Recording sessions for Thriller took place from April to November 1982 at Westlake Recording Studios in Los Angeles, California with a production budget of 750,000(ð‘ˆð‘†



508

<br>

<h3 style="font-size:120%">upper()</h3>

Sometimes we need the whole string to be in Upper Case. To do so, there is a function called ``upper()`` in Python, which takes a string as input and provides the whole string in upper case as output.

In [6]:
to_upper = summary[0].upper()
to_upper

"THRILLER IS THE SIXTH STUDIO ALBUM BY THE AMERICAN SINGER MICHAEL JACKSON, RELEASED ON NOVEMBER 30, 1982 BY EPIC RECORDS. IT IS THE FOLLOW-UP TO JACKSON'S CRITICALLY AND COMMERCIALLY SUCCESSFUL FIFTH STUDIO ALBUM OFF THE WALL (1979). THRILLER EXPLORES GENRES SIMILAR TO THOSE OF OFF THE WALL, INCLUDING POP, POST-DISCO, ROCK AND FUNK. RECORDING SESSIONS FOR THRILLER TOOK PLACE FROM APRIL TO NOVEMBER 1982 AT WESTLAKE RECORDING STUDIOS IN LOS ANGELES, CALIFORNIA WITH A PRODUCTION BUDGET OF 750,000(Ð\x9d‘ˆÐ\x9d‘†\n"

In the above code block, we converted the summary text to all uppercase.

<br>

<h3 style="font-size:120%">lower()</h3>

Similarly, the **lower()** function can be used to convert whole string into lower case. Let's convert the same string that we converted in uppercase, to lowercase.

In [8]:
to_lower = summary[0].title()
to_lower

"Thriller Is The Sixth Studio Album By The American Singer Michael Jackson, Released On November 30, 1982 By Epic Records. It Is The Follow-Up To Jackson'S Critically And Commercially Successful Fifth Studio Album Off The Wall (1979). Thriller Explores Genres Similar To Those Of Off The Wall, Including Pop, Post-Disco, Rock And Funk. Recording Sessions For Thriller Took Place From April To November 1982 At Westlake Recording Studios In Los Angeles, California With A Production Budget Of 750,000(Ð\x9d‘ˆÐ\x9d‘†\n"

We can clearly see the difference between the outputs of last two functions.

<h3 style="font-size:120%">replace()</h3>

What if we want to replace characters in a string?

This operation can also be performed in Python using the ``replace()`` function which takes two arguments. The first argument is the characters which we want to replace in the string and the second argument is the new characters.

Let's replace **white spaces** in the string with the **Hyphens (“-”)** in the first element of the ``summary`` variable. 

In [9]:
replace_chars = summary[0].replace(" ", "-")
replace_chars

"Thriller-is-the-sixth-studio-album-by-the-American-singer-Michael-Jackson,-released-on-November-30,-1982-by-Epic-Records.-It-is-the-follow-up-to-Jackson's-critically-and-commercially-successful-fifth-studio-album-Off-the-Wall-(1979).-Thriller-explores-genres-similar-to-those-of-Off-the-Wall,-including-pop,-post-disco,-rock-and-funk.-Recording-sessions-for-Thriller-took-place-from-April-to-November-1982-at-Westlake-Recording-Studios-in-Los-Angeles,-California-with-a-production-budget-of-750,000(ð\x9d‘ˆð\x9d‘†\n"

### count()

We can count occurrences of words in strings using `count()`. Let's find out how many times the word "Thriller" occurs in `summary`.

In [10]:
summary[0].count('Thriller')

3

<h3 style="font-size:120%">split()</h3>

Previously, we learned that we can read files line-by-line using the ``readlines()`` function. But what if we want to split the given string word-by-word?

This can be done using ``split()`` method. Let's split the string according to the white spaces.

In [11]:
word_list = summary[0].split()
word_list

['Thriller',
 'is',
 'the',
 'sixth',
 'studio',
 'album',
 'by',
 'the',
 'American',
 'singer',
 'Michael',
 'Jackson,',
 'released',
 'on',
 'November',
 '30,',
 '1982',
 'by',
 'Epic',
 'Records.',
 'It',
 'is',
 'the',
 'follow-up',
 'to',
 "Jackson's",
 'critically',
 'and',
 'commercially',
 'successful',
 'fifth',
 'studio',
 'album',
 'Off',
 'the',
 'Wall',
 '(1979).',
 'Thriller',
 'explores',
 'genres',
 'similar',
 'to',
 'those',
 'of',
 'Off',
 'the',
 'Wall,',
 'including',
 'pop,',
 'post-disco,',
 'rock',
 'and',
 'funk.',
 'Recording',
 'sessions',
 'for',
 'Thriller',
 'took',
 'place',
 'from',
 'April',
 'to',
 'November',
 '1982',
 'at',
 'Westlake',
 'Recording',
 'Studios',
 'in',
 'Los',
 'Angeles,',
 'California',
 'with',
 'a',
 'production',
 'budget',
 'of',
 '750,000(ð\x9d‘ˆð\x9d‘†']

<h3 style="font-size:120%">sort()</h3>

Sorting is also possible in Python. Let's use ``sort()`` method to sort elements of the ``word_list`` in ascending order.

In [12]:
word_list.sort()
word_list

['(1979).',
 '1982',
 '1982',
 '30,',
 '750,000(ð\x9d‘ˆð\x9d‘†',
 'American',
 'Angeles,',
 'April',
 'California',
 'Epic',
 'It',
 "Jackson's",
 'Jackson,',
 'Los',
 'Michael',
 'November',
 'November',
 'Off',
 'Off',
 'Recording',
 'Recording',
 'Records.',
 'Studios',
 'Thriller',
 'Thriller',
 'Thriller',
 'Wall',
 'Wall,',
 'Westlake',
 'a',
 'album',
 'album',
 'and',
 'and',
 'at',
 'budget',
 'by',
 'by',
 'commercially',
 'critically',
 'explores',
 'fifth',
 'follow-up',
 'for',
 'from',
 'funk.',
 'genres',
 'in',
 'including',
 'is',
 'is',
 'of',
 'of',
 'on',
 'place',
 'pop,',
 'post-disco,',
 'production',
 'released',
 'rock',
 'sessions',
 'similar',
 'singer',
 'sixth',
 'studio',
 'studio',
 'successful',
 'the',
 'the',
 'the',
 'the',
 'the',
 'those',
 'to',
 'to',
 'to',
 'took',
 'with']

<h3 style="font-size:120%">Slicing Lists(Slicing)</h3>

We can slice the word list, to return a subsection of the list. 

This is done by using `word_list[start:end]`

In our example, we return all list items from position 8 to 29. We store this new list into the variable `substring`.

In [13]:
substring = word_list[8:30]
substring

['California',
 'Epic',
 'It',
 "Jackson's",
 'Jackson,',
 'Los',
 'Michael',
 'November',
 'November',
 'Off',
 'Off',
 'Recording',
 'Recording',
 'Records.',
 'Studios',
 'Thriller',
 'Thriller',
 'Thriller',
 'Wall',
 'Wall,',
 'Westlake',
 'a']

---