<a href="https://colab.research.google.com/github/Noreen999/AI-and-Data-Science/blob/main/Module_03_Lab_01__String_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  String Manipulation

## 1. String Object Methods
As an example, a comma-separated string can be broken into pieces with split.

In [None]:
 val = 'a,b,  lerning'

In [None]:
#split
val.split()

['a,b,', 'lerning']

In [None]:
val.split(',')

['a', 'b', '  lerning']

split is often combined with strip to trim whitespace (including line breaks)

In [None]:
pieces = [x.strip() for x in val.split(',')]

In [None]:
pieces

['a', 'b', 'lerning']

These substrings could be concatenated together with a two-colon delimiter using addition.

In [None]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::lerning'

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a list or tuple to the join method on the string '::'.

In [None]:
 '::'.join(pieces)

'a::b::lerning'

Other methods are concerned with locating substrings. Using Python’s in keyword is the best way to detect a substring, though index and find can also be used:

In [None]:
'lerning'in val

True

In [None]:
val.index(',')

1

In [None]:
val.find(':')

-1

Note the difference between find and index is that index raises an exception if the string isn’t found (versus returning –1).

Replace will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string.

In [None]:
 val.replace(',', '::')

'a::b::  lerning'

In [None]:
 val.replace(',', '')

'ab  lerning'

## 2. Regular Expressions
suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+:

In [None]:
import re
text = "foo    bar\t baz  \tqux"

In [None]:
text

'foo    bar\t baz  \tqux'

In [None]:
 re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object.

In [None]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [None]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com Ryan ryan@yahoo.com """
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [None]:
 regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

## 3. Vectorized String Functions in pandas
 To complicate matters, a column containing strings will sometimes have missing data:

In [None]:
import numpy as np
import pandas as pd

In [None]:
 data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
         'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [None]:
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [None]:
 data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains.

In [None]:
 data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute:

In [None]:
 matches = data.str.match(pattern, flags=re.IGNORECASE)

To access elements in the embedded lists, we can pass an index to either of these functions:

In [41]:
df1=pd.read_csv('/content/twitter_dataset.csv')

In [43]:
df1

Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21
...,...,...,...,...,...,...
9995,9996,ntate,Agree reflect military box ability ever hold. ...,81,86,2023-01-15 11:46:20
9996,9997,garrisonjoshua,Born which push still. Degree sometimes contro...,73,100,2023-05-06 00:46:54
9997,9998,adriennejackson,You day agent likely region. Teacher data mess...,10,62,2023-02-27 14:55:08
9998,9999,kcarlson,Guess without successful save. Particular natu...,21,60,2023-01-09 16:09:35


In [49]:
 df1['Username']

0               julie81
1         richardhester
2        williamsjoseph
3           danielsmary
4            carlwarren
             ...       
9995              ntate
9996     garrisonjoshua
9997    adriennejackson
9998           kcarlson
9999         vdickerson
Name: Username, Length: 10000, dtype: object

In [51]:
 df1['Retweets']

0        2
1       35
2       51
3       37
4       27
        ..
9995    81
9996    73
9997    10
9998    21
9999    65
Name: Retweets, Length: 10000, dtype: int64

In [56]:
df1=pd.Series()

  df1=pd.Series()


In [57]:
df1

Series([], dtype: float64)