### Splitting

There are methods that can convert between a string and other types of data, such as lists by breaking a string into pieces. 

Let's work with the following example. We want to split the string into a list of substrings. Python provides us with two methods: dot split and dot rsplit. Both of them return a list.They both take a separating element by which we are splitting the string, and a maxsplit that tells us the maximum number of substrings we want. 

In [1]:
my_string = "This string will be split"
my_string

'This string will be split'

In [5]:
## .split() method
my_string.split(sep=" ", maxsplit=2)

['This', 'string', 'will be split']

In [6]:
my_string.rsplit(sep=" ", maxsplit=2)

['This string will', 'be', 'split']

As we can see in the code, the difference is that .split() starts splitting at the left. .rsplit() begins at the right of the string. 

If maxsplit is not specified both methods behave in the same way. They give as many substrings as possible. If you want the split to be done by the whitespace you don't have to specify the sep argument.

In [7]:
## .split() method
my_string.split()

['This', 'string', 'will', 'be', 'split']

In [8]:
my_string.rsplit()

['This', 'string', 'will', 'be', 'split']

Consider the following string. If we print it out, we can see that contains two lines. Why is that? 

In [9]:
my_string = "This string will be split\nin two"
print(my_string)

This string will be split
in two


In [10]:
my_string2 = "This string will be split\rin two"
print(my_string2) 

This string will be splitin two


There are some escape sequences such as \n or \r that indicates a line boundary. Sometimes, we want to split a string into lines. So in the case of our string, we want to split it at the \n.

### splitlines

For this aim, Python has the method splitlines(). 

In [11]:
my_string = "This string will be split\nin two"
print(my_string)

This string will be split
in two


In [12]:
my_string.splitlines()

['This string will be split', 'in two']

As we can see in the code, the string is split at the slash n sequence returning a list of two elements.

### Joining
Some methods can paste or concatenate together the objects in a list or other iterable data. This is the case for dot join method. The syntax is simple. 

- It first takes the separating element. 

- Inside the call, we specify the list or iterable element. 

In [13]:
my_list = ["this", "would", "be", "a", "string"]
my_list

['this', 'would', 'be', 'a', 'string']

In [15]:
list_to_str = " ".join(my_list)
list_to_str

'this would be a string'

We can observe in the example, that whitespace is specified as a separator and the data type is a list. The result is a single string containing all the objects in the list separated by whitespace.

### Stripping characters

Lastly, we'll talk about methods that will trim characters from a string. The dot strip method will remove both leading and trailing characters. Inside the call, we can specifying a character. If we don't do it, whitespace will be removed. 

Let's say we have the following string. And we apply the dot strip method as shown. We get a string where both the leading space and the trailing escape sequence were removed.

In [17]:
my_string = " This string will be stripped\n"
print(my_string)

new_str = my_string.strip()
print(new_str)

 This string will be stripped

This string will be stripped


### Stripping characters2

- We can apply .rstrip() method and it will return a string where the trailing slash n was removed. 

- If we apply the .lstrip() method, we'll get a string with the leading whitespace eliminated.

In [19]:
my_string = " This string will be stripped\n"
print(my_string)

## applying .rstrip(), it'll remove trailing \n
print(my_string.rstrip())

## applying .lstrip(), it'll remove leading whitespace
print(my_string.lstrip())

 This string will be stripped

 This string will be stripped
This string will be stripped



### Exercise 1: Normalizing reviews

It's time to extract some important words present in your movie review dataset. First, you need to normalize them and then, count their frequency. Part of the normalization implies converting all the words to lowercase, removing special characters and extracting the root of a word so you count the variants as one.

So imagine you have the following reviews: 'The movie surprises me very much' and 'Marvel movies always surprise their audience'. If you count the word frequency, you will count 'surprises' one time and 'surprise' one time. However, the verb 'surprise' appears in both and its frequency should be two.

The text of a movie review for only one example has been already saved in the variable movie.

In [20]:
movie = '$I supposed that coming from MTV Films I should expect no less$'
print(movie)

$I supposed that coming from MTV Films I should expect no less$


- Convert the string in the variable movie to lowercase. Print the result.


- Remove the '$' that occur at the start and at the end of the string contained in movie_lower. Print the results.


- Split the string contained in movie_no_sign into as many substrings as possible. Print the results.


- Get the root of the second word contained in movie_split.

In [21]:
# Convert the string in the variable movie to lowercase. Print the result.
movie_lower = movie.lower()
print(movie_lower)

$i supposed that coming from mtv films i should expect no less$


In [23]:
# Remove the '$' that occur at the start and at the end of the string contained in movie_lower. Print the results.
movie_no_sign = movie_lower.strip("$")
print(movie_no_sign)

i supposed that coming from mtv films i should expect no less


In [24]:
# Split the string contained in movie_no_sign into as many substrings as possible. Print the results.
movie_split = movie_no_sign.split()
print(movie_split)

['i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less']


In [25]:
# Get the root of the second word contained in movie_split.
root_word = movie_split[1][:-1]
print(root_word)

suppose


### Exercise 2: Time to join!

The text of a movie review has been already saved in the variable movie.

In [27]:
movie = 'the film,however,is all good<\\i>'
print(movie)

the film,however,is all good<\i>


While normalizing your text, you noticed that one review had a particular structure. This review ends with the HTML tag <\\i> and it has a lot of commas in different places of the sentence. You decide to remove the tag from the end and use the strategy of splitting the string and joining it back again without the commas.

- Remove tag <\\i> from the end of the string. Print the results.


- Split the string contained in movie_tag using the commas as a separating element. Print the results.


- Join back together the list of substring contained in movie_no_comma using a space as a join element. Print the results.

In [28]:
# Remove tag <\i> from the end of the string. Print the results.
movie_tag = movie.strip("<\\i>")
print(movie_tag)

the film,however,is all good


In [29]:
# Split the string contained in movie_tag using the commas as a separating element. Print the results.
movie_no_comma = movie_tag.split(",")
print(movie_no_comma)

['the film', 'however', 'is all good']


In [31]:
## Join back together the list of substring contained in movie_no_comma using a space as a join element. Print the results.
movie_join = " ".join(movie_no_comma)
print(movie_join)

the film however is all good


### Exercise 3: Split lines or split the line?

The text of the file has been already saved in the variable file. 

In [32]:
file = 'mtv films election, a high school comedy, is a current example\nfrom there, director steven spielberg wastes no time, taking us into the water on a midnight swim'
print(file)

mtv films election, a high school comedy, is a current example
from there, director steven spielberg wastes no time, taking us into the water on a midnight swim


You are about to leave work when a colleague asks you to use your string manipulation skills to help him. You need to read strings from a file in a way that if the file contains strings on different lines, they are stored as separate elements. He also wants you to break the strings into pieces if you see that they contain commas.

- Split the string file into many substrings at line boundaries. Print out the resulting variable file_split.


- Split the strings into many substrings using commas as a separator element.


- Split the strings into many substrings using a space as a separator element.

In [33]:
## Split the string file into many substrings at line boundaries. Print out the resulting variable file_split.
file_split = file.splitlines()
print(file_split)

['mtv films election, a high school comedy, is a current example', 'from there, director steven spielberg wastes no time, taking us into the water on a midnight swim']


In [36]:
# Split the strings into many substrings using commas as a separator element.
for substrings in file_split:
    print(substrings)
    substrings_split = substrings.split(',')
    print(substrings_split)

mtv films election, a high school comedy, is a current example
['mtv films election', ' a high school comedy', ' is a current example']
from there, director steven spielberg wastes no time, taking us into the water on a midnight swim
['from there', ' director steven spielberg wastes no time', ' taking us into the water on a midnight swim']


In [37]:
# Split the strings into many substrings using a space as a separator element.
for substrings in file_split:
    print(substrings)
    substrings_split = substrings.split(' ')
    print(substrings_split)

mtv films election, a high school comedy, is a current example
['mtv', 'films', 'election,', 'a', 'high', 'school', 'comedy,', 'is', 'a', 'current', 'example']
from there, director steven spielberg wastes no time, taking us into the water on a midnight swim
['from', 'there,', 'director', 'steven', 'spielberg', 'wastes', 'no', 'time,', 'taking', 'us', 'into', 'the', 'water', 'on', 'a', 'midnight', 'swim']


In [38]:
import pandas as pd

In [39]:
df = pd.read_csv("short_movies.csv")
df

Unnamed: 0,id,tag,html,sent id,text,target
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos
...,...,...,...,...,...,...
19995,6,cv615,14182,20,"wilder's script is stunning , as he carefully ...",pos
19996,6,cv615,14182,21,wilder manages to create scenes of utter hyste...,pos
19997,6,cv615,14182,22,wilder's is dead wrong when he says nobody's p...,pos
19998,6,cv615,14182,23,nobody's perfect but billy wilder .,pos


In [43]:
print(df["text"].iloc[137])

heck , jackie doesn't even have enough money for a haircut , looks like , much less a personal hairstylist .


In [44]:
print(df["text"].iloc[138])

in condor , chan plays the same character he's always played , himself , a mixture of bruce lee and tim allen , a master of both kung-fu and slapstick-fu .


In [45]:
print(df["text"].iloc[200])

it's clear that he's passionate about his beliefs , and that he's not just a punk looking for an excuse to beat people up .


In [46]:
print(df["text"].iloc[201])

I believe you I always said that the actor actor actor is amazing in every movie he has played


In [47]:
print(df["text"].iloc[202])

it's astonishing how frightening the actor actor norton looks with a shaved head and a swastika on his chest.


###  Finding substrings

Python has several built-in methods that will help you search a target string for a specified substring. 

The first method is .find(). As you can see in the slide, it takes the desired substring as a mandatory argument. You can specify two other arguments. An inclusive starting position and an exclusive ending position. 

<img src="f.jpg" style="max-width:500px">

In [48]:
my_string = "Where's Waldo?"
print(my_string)

Where's Waldo?


In [50]:
my_string.find("Waldo")

8

In [51]:
my_string.find("Wenda")

-1

In the example code, we search for Waldo in the string Where's Waldo?. The dot find method returns the lowest index in the string where it can find the substring, in this case, eight. 

If we search for Wenda, the substring is not found and the method returns minus one.

Now, we check if we can find Waldo between characters number zero and five. In the code, we specify the starting position, zero, and the ending position, six, because this position is not inclusive. 

In [52]:
my_string.find("Waldo", 0, 6)

-1

The dot find method will not find the substring and returns minus one as we see in the output.

###  Index function

The dot index method is identical to dot find. In the slide, we see that it takes the desired substring as mandatory argument. It can take optional starting and ending positions as well. 

<img src="i-1.jpg" style="max-width:500px">

In [53]:
my_string.index("Waldo")

8

In [54]:
my_string.index("Wenda")

ValueError: substring not found

In the example, we search again for Waldo using dot index. We get eight again. When we look for a substring that is not there, we have a difference. The dot index method raises an exception, as we can see in the output.

We can handle this using the try except block. In the slide, you can observe the syntax. The try part will test the given code. If any error appears the except part will be executed obtaining the following output as a result.

In [55]:
my_string = "Where's Waldo?"

try:
    my_string.index("Wenda")
except ValueError:
    print("Not found")

Not found


### Counting occurrences

The dot count method searches for a specified substring in the target string. It returns the number of non-overlapping occurrences. In simple words, how many times the substring is present in the string. The syntax of dot count is very similar to the other methods as we can observe. 


<img src="c.jpg" style="max-width:500px">

In [56]:
my_string = "How many fruits do you have in your fruit basket?"
my_string.count("fruit")

2

In the example, we use the dot count method to get how many times fruit appears. In the output, we see that is two.

Then, we limit the occurrences of fruit between character zero and fifteen of the string as we can observe in the code. The method will return 1.

Remember that starting position is inclusive, but the ending is not.

In [57]:
my_string.count("fruit", 0, 16)

1

###  Replacing substrings

Sometimes you will want to replace occurrences of a substring with a new substring. In this case, Python provides us with the dot replace method. As we see in the slide, it takes three arguments: the substring to replace, the new string to replace it, and an optional number that indicates how many occurrences to replace. 

<img src="r.jpg" style="max-width:500px">

In the example code, we replace the substring house with car. The method will return a copy with all house substrings replaced. 

In [58]:
my_string = "The red house is between the blue house and the old house"
print(my_string.replace("house", "car"))

The red car is between the blue car and the old car


In the next example, we specified that we only want 2 of the occurrences to be replaced.

In [59]:
print(my_string.replace("house", "car", 2))

The red car is between the blue car and the old house


We see in the output that the method return a copy of the string with the first two occurrences of house replaced by car.

### Exercise 4: Finding a substring

The text of three movie reviews has been already saved in the variable movies. 

In [61]:
movies = [df["text"].iloc[200],df["text"].iloc[201], df["text"].iloc[202]]
print(movies)

["it's clear that he's passionate about his beliefs , and that he's not just a punk looking for an excuse to beat people up .", 'I believe you I always said that the actor actor actor is amazing in every movie he has played', "it's astonishing how frightening the actor actor norton looks with a shaved head and a swastika on his chest."]


It's a new day at work and you need to continue cleaning your dataset for the movie prediction project. While exploring the dataset, you notice a strange pattern: there are some repeated, consecutive words occurring between the character at position 37 and the character at position 41. 

You decide to write a function to find out which movie reviews show this peculiarity, remembering that the ending position you specify is not inclusive. If you detect the word, you also want to change the string by replacing it with only one instance of the word.

- Find if the substring 'actor' occurs between the characters with index 37 and 41 inclusive. If it is not detected, print the statement 'Word not found'.


- Replace 'actor actor' with the substring 'actor' if actor occurs only two repeated times.


- Replace 'actor actor actor' with the substring 'actor' if actor appears three repeated times.

In [62]:
for movie in movies:
    if movie.find("actor", 37,42) == -1:
        print("Word not found")
        
    elif movie.count("actor") == 2:
        print(movie.replace("actor actor", "actor"))
        
    elif movie.count("actor") > 2:
        print(movie.replace("actor actor actor", "actor"))

Word not found
I believe you I always said that the actor is amazing in every movie he has played
it's astonishing how frightening the actor norton looks with a shaved head and a swastika on his chest.


### Exercise 5: Where's the word?

The text of two movie reviews has been already saved in the variable movies2. 

In [63]:
movies2 = [df["text"].iloc[137],df["text"].iloc[138]]
print(movies2)

["heck , jackie doesn't even have enough money for a haircut , looks like , much less a personal hairstylist .", "in condor , chan plays the same character he's always played , himself , a mixture of bruce lee and tim allen , a master of both kung-fu and slapstick-fu ."]


Before finishing cleaning your dataset, you want to check if a specific word occurs in the reviews. You noticed earlier a specific pattern in the strings. Now, you want to create a function to check if a word is present between characters with index 12, and 50, remembering that ending position is exclusive, and print out the lowest index where this word occurs. There are two methods to handle this situation. You want to see which one works best.

- Find the index where 'money' occurs between characters with index 12 and 50. If not found, the method should return -1.

- Find the index where 'money' occurs between characters with index 12 and 50. If not found, it should raise an error.

In [66]:
## Find the index where money occurs between characters with index 12 and 50. If not found, the method should return -1.
for review in movies2:
    print(review.find("money", 12, 51))

39
-1


In [67]:
# Find the index where 'money' occurs between characters with index 12 and 50. If not found, it should raise an error.
for review in movies2:
    try:
        print(review.index("money", 12, 51))
    except:
        print("substring not found")

39
substring not found


### Exercise 6: Replacing negations

In order to keep working with your prediction project, your next task is to figure out how to handle negations that occur in your dataset. Some algorithms for prediction do not work well with negations, so a good way to handle this is to remove either `not or n't`, and to replace the next word by its `antonym`.

Let's imagine that you have the string: `The movie isn't good`. You will need to remove `n't` and replace `good for bad`. This way, your string ends up being `The movie is bad`. You notice that in the first column of the dataset, you have a string that uses the word isn't followed by important.

The text of this column has been already saved in the variable `movies3` so you start working with it. 

In [68]:
movies3 = "the rest of the story isn't important because all it does is serve as a mere backdrop for the two stars to share the screen ."
print(movies3)

the rest of the story isn't important because all it does is serve as a mere backdrop for the two stars to share the screen .


- Replace the substring 'isn't' with the word 'is'.


- Replace the substring 'important' with the word 'insignificant'.


- Print out the result contained in the variable movies_antonym.

In [69]:
## Replace the substring 'isn't' with the word 'is'.
movies_no_negation = movies3.replace("isn't", "is")
print(movies_no_negation)

the rest of the story is important because all it does is serve as a mere backdrop for the two stars to share the screen .


In [71]:
## Replace the substring 'important' with the word 'insignificant'.
movies_antonym = movies_no_negation.replace('important', 'insignificant')
print(movies_antonym)

the rest of the story is insignificant because all it does is serve as a mere backdrop for the two stars to share the screen .
