# Data Cleaning and Preprocessing

**Cleaning** and **preprocessing** of datasets consumes around **80% of your time** as a data scientist.

Data prepration includes:
- Data loading
- Data cleaning
- Data transforming
- Data rearranging

## 1) Handling Missing Data

- It is important to **handle missing data** to manage its **side effects** on **results** of data analysis.
- In **panadas**, **missing data** is represented by **NaN** which is a shorting for **Not a Number**.

In [1]:
import numpy as np
import pandas as pd

In [6]:
s = pd.Series([23, 54, np.nan, None])
s

0    23.0
1    54.0
2     NaN
3     NaN
dtype: float64

In [7]:
s.isnull()

0    False
1    False
2     True
3     True
dtype: bool

In [8]:
s.isna()

0    False
1    False
2     True
3     True
dtype: bool

In [9]:
s = pd.Series(['green','black', 'white', None ,'red'])
s

0    green
1    black
2    white
3     None
4      red
dtype: object

As it is a **string** object, **None** is represented differently (not NaN).

In [10]:
s.isnull()

0    False
1    False
2    False
3     True
4    False
dtype: bool

In [11]:
s.isna()

0    False
1    False
2    False
3     True
4    False
dtype: bool

## 2) Filtering Out Missing Data

There are **two methods** to **delete missing data** in **pandas**:
- Using the function **dropna()**
- Using the function **notnull()** and **Boolean indexing**

In [63]:
s = pd.Series([23,54, np.nan, None, 34, 87])
s

0    23.0
1    54.0
2     NaN
3     NaN
4    34.0
5    87.0
dtype: float64

In [13]:
s.dropna()

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

In [14]:
s.notnull()

0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

In [15]:
s[s.notnull()]

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

Notice that the **original series** has **not been changed**.

In [16]:
s

0    23.0
1    54.0
2     NaN
3     NaN
4    34.0
5    87.0
dtype: float64

To make the **changes permenant**, we use the argument **inplace = True**:

In [17]:
s.dropna(inplace=True)
s

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

In [60]:
df = pd.DataFrame([[1,None,3, 4,5], [6,None, 8,9, 10], [11, 12,13,14,15]])
df

Unnamed: 0,0,1,2,3,4
0,1,,3,4,5
1,6,,8,9,10
2,11,12.0,13,14,15


In **dataframe**, to **delete missing values** by applying a **dropna()** function, by default the **entire row** with missing value is **deleted**

In [54]:
df.isna()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,False
1,False,True,False,False,False
2,False,False,False,False,False


In [19]:
df.dropna()

Unnamed: 0,0,1,2,3,4
2,11,12.0,13,14,15


To **delete** the **entire column** that has **missing value** we use the argument **axis = 1**

In [20]:
df.dropna(axis=1)

Unnamed: 0,0,2,3,4
0,1,3,4,5
1,6,8,9,10
2,11,13,14,15


In [26]:
df = pd.DataFrame([[6,None, 8,9, 10], [None,None,None, None,None], [11,12,13,14,15], [16,17,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6.0,,8.0,9.0,10.0
1,,,,,
2,11.0,12.0,13.0,14.0,15.0
3,16.0,17.0,18.0,19.0,20.0


We can instruct **pandas** to **delete only rows** or **columns** that are **all missing**, using argument **how = all**:

In [27]:
df.dropna(how='all')

Unnamed: 0,0,1,2,3,4
0,6.0,,8.0,9.0,10.0
2,11.0,12.0,13.0,14.0,15.0
3,16.0,17.0,18.0,19.0,20.0


In [28]:
df = pd.DataFrame([[6,None, np.nan,9, 10],[1,None,2, 3,4], [11, None,13,14,15], [16,np.nan,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6,,,9,10
1,1,,2.0,3,4
2,11,,13.0,14,15
3,16,,18.0,19,20


In [29]:
df.dropna(axis=1, how='all')

Unnamed: 0,0,2,3,4
0,6,,9,10
1,1,2.0,3,4
2,11,13.0,14,15
3,16,18.0,19,20


We can also instruct **pandas** to **keep rows** with **certain number of values** using the argument **thresh**

In [56]:
df = pd.DataFrame([[6,None, np.nan,9, 10],[1,2,None, np.nan,np.nan], [11, None,13,14,15], [np.nan,16,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6.0,,,9.0,10.0
1,1.0,2.0,,,
2,11.0,,13.0,14.0,15.0
3,,16.0,18.0,19.0,20.0


In [57]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2,3,4
0,6.0,,,9.0,10.0
2,11.0,,13.0,14.0,15.0
3,,16.0,18.0,19.0,20.0


Any row that has less than 3 valid observations was deleted.

In [58]:
df.dropna(axis=1, thresh=3)

Unnamed: 0,0,3,4
0,6.0,9.0,10.0
1,1.0,,
2,11.0,14.0,15.0
3,,19.0,20.0


## 3) Filling in Missing Data

- **Instead** of **deleting** the missing values, we can **fill in** missing data
- By **deleting** the **missing values**, you are **discarding** huge amount of **valuable collected data**.
- Normally we **fill** the missing data with **neutral values** that will **not skew or change our data in a biased direction**.
- You can **fill** the missing data by using a pandas function called **fillna()**.

In [62]:
df = pd.read_csv('data/temperature.csv', index_col = 'time')
df

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,
10,25.0,30.0,23,34.0
11,,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


In [66]:
s.fillna()

0    23.0
1    54.0
2    11.0
3    11.0
4    34.0
5    87.0
dtype: float64

In [40]:
df.fillna(20)

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,20.0
10,25.0,30.0,23,34.0
11,20.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,20.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


In [41]:
df.fillna({'day1':20, 'day2':25, 'day4':30})

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,30.0
10,25.0,30.0,23,34.0
11,20.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,25.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also use the argument **(method = ffill)** to **fill** the **missing values** with the values that just **precedes** it:

In [43]:
df.fillna(method='ffill')

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,30.0
10,25.0,30.0,23,34.0
11,25.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,34.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also use the argument **(method = bfill)** to **fill** the **missing values** with the values that **comes after** it:

In [44]:
df.fillna(method='bfill')

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,34.0
10,25.0,30.0,23,34.0
11,27.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,31.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also **fill** the **missing values** with the **mean**, the mean will be calculated separately for **each column**

In [46]:
df.fillna(df.mean())

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,34.272727
10,25.0,30.0,23,34.0
11,24.416667,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,29.5,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


## 4) Removing Duplicate Entries

**Duplicate** entries **skew the analysis** and could also **inflate the statistics**.

There are **two methods** in **pandas** that are used to check and to remove duplicate entries:
- **duplicated()** which is used to **check for double entries**
- **drop_duplicates()** which is used to **delete double entries**

In [47]:
df = pd.read_csv('data/ex1.csv')
df

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
3,Andy Allanson,293,66,1,30
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
7,Alan Ashby,315,81,7,24
8,Al Newman,185,37,1,23
9,Alan Ashby,315,81,7,24


In [48]:
df.duplicated()

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7      True
8     False
9      True
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
dtype: bool

When it is **True**, it means that it is **duplicated row**.

We can **add** the function **any()** to the previous code to return a **single boolean value** to check for **duplicates** in the **entire dataframe**.

In [49]:
df.duplicated().any()

True

**drop_duplicates()**: To **delete** the **duplicate entries**

In [50]:
df.drop_duplicates()

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
8,Al Newman,185,37,1,23
10,Argenis Salazar,298,73,0,24
11,Andres Thomas,323,81,6,26
12,Andre Thornton,401,92,17,49


In [51]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
8,Al Newman,185,37,1,23
10,Argenis Salazar,298,73,0,24
11,Andres Thomas,323,81,6,26
12,Andre Thornton,401,92,17,49


## 5) Replacing Values

**replace()**: To **replace values** in **panadas**

In [73]:
s = pd.Series([23, 37, 999, 32, 32, 28, 999, 19, 24])
s

0     23
1     37
2    999
3     32
4     32
5     28
6    999
7     19
8     24
dtype: int64

In [74]:
s.replace(999, np.nan, inplace=True)
s

0    23.0
1    37.0
2     NaN
3    32.0
4    32.0
5    28.0
6     NaN
7    19.0
8    24.0
dtype: float64

In [75]:
s = pd.Series([23,37,999,32,32,28,1000,19,20,-999,24])
s

0       23
1       37
2      999
3       32
4       32
5       28
6     1000
7       19
8       20
9     -999
10      24
dtype: int64

In [76]:
s.replace([999, 1000, -999], np.nan, inplace=True)
s

0     23.0
1     37.0
2      NaN
3     32.0
4     32.0
5     28.0
6      NaN
7     19.0
8     20.0
9      NaN
10    24.0
dtype: float64

We can also used a **dictionary** inside the **replace()** function to **replace different values**.

In [78]:
df = pd.DataFrame(['male','femal','male','male','female','mal', 'female'], index=list('abcdefg'), columns=['gender'])
df

Unnamed: 0,gender
a,male
b,femal
c,male
d,male
e,female
f,mal
g,female


In [79]:
df.replace({'mal': 'male', 'femal':'female'})

Unnamed: 0,gender
a,male
b,female
c,male
d,male
e,female
f,male
g,female


## 6) Renaming Columns and Index Labels

**rename()**: To **rename** the **labels** in **dataframes**

In [82]:
df = pd.DataFrame(np.arange(12).reshape((4, 3)), index=['green', 'red', 'black', 'white'], columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
green,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


In [83]:
df.rename(index={'green':'yellow'}, inplace=True)
df

Unnamed: 0,one,two,three
yellow,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


In [84]:
df.rename(columns={'three':'four'}, inplace=True)
df

Unnamed: 0,one,two,four
yellow,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


We can aslo **change** the **format** of the **labels**, for example here we change the **index labels** to have **capital letters** using the function **str.upper()**

In [93]:
df.index = df.index.str.upper()
df

Unnamed: 0,One,Two,Four
YELLOW,0,1,2
RED,3,4,5
BLACK,6,7,8
WHITE,9,10,11


Similarly, we can **change** the **format** of **columns labels** to a **title format** using **str.title()**

In [94]:
df.columns = df.columns.str.title()
df

Unnamed: 0,One,Two,Four
YELLOW,0,1,2
RED,3,4,5
BLACK,6,7,8
WHITE,9,10,11


- **str.count()**: To return the number of occurrences of substring in the string.
- **str.join()**: To join the elements using the delimiter passed to the function.
- **str.strip()**: To trim whitespaces
- **str.lower()**: To convert alphabet characters to lowercase.
- **str.upper()**: To convert alphabet characters to uppercase.
- **str.title()**: To convert the first character in each word to uppercase and remaining characters to lowercase

## 7) Filtering Outliers

- An **outlier** is an **observation** that lies an **abnormal** distance from **other observations**.
- Deciding which value is an outlier is a **subjective decision** for the analyst.

In [98]:
df = pd.read_csv('data/ex2.csv')
df.head(15)

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3,38
1,43.54,232,4,69
2,152.298,828,4,41
3,55.367,448,1,33
4,11.741,182,44,59
5,15.56,352,4,57
6,59.53,543,3,52
7,20.191,431,4,42
8,48.498,456,3,47
9,30.733,249,4,51


In [99]:
df.describe()

Unnamed: 0,Income,Rating,Cards,Age
count,20.0,20.0,20.0,20.0
mean,39.6811,352.7,7.6,51.05
std,33.482958,179.936568,14.485565,12.94716
min,11.741,120.0,1.0,26.0
25%,17.60175,238.75,2.0,41.75
50%,28.4,309.0,3.0,49.0
75%,50.21525,450.0,4.0,59.0
max,152.298,828.0,55.0,74.0


In [100]:
df[df.Cards > 5]

Unnamed: 0,Income,Rating,Cards,Age
4,11.741,182,44,59
12,14.084,120,55,46


In [101]:
df.loc[4, 'Cards'] = 4
df.loc[12, 'Cards'] = 5

In [102]:
df.Cards > 5

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: Cards, dtype: bool

In [103]:
(df.Cards > 5).any()

False

In [104]:
df = pd.read_csv('data/ex2.csv')
df.head(15)

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3,38
1,43.54,232,4,69
2,152.298,828,4,41
3,55.367,448,1,33
4,11.741,182,44,59
5,15.56,352,4,57
6,59.53,543,3,52
7,20.191,431,4,42
8,48.498,456,3,47
9,30.733,249,4,51


In [105]:
df.loc[4, 'Cards'] = np.nan
df.loc[12, 'Cards'] = np.nan
df

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3.0,38
1,43.54,232,4.0,69
2,152.298,828,4.0,41
3,55.367,448,1.0,33
4,11.741,182,,59
5,15.56,352,4.0,57
6,59.53,543,3.0,52
7,20.191,431,4.0,42
8,48.498,456,3.0,47
9,30.733,249,4.0,51


## 8) Shuffling and Random Sampling

In [112]:
s = pd.Series(np.random.randint(20, size =10))
s

0    16
1     6
2     4
3    14
4    17
5    19
6    10
7    16
8     5
9    17
dtype: int64

**sample()**: To **randomly shuffle** the **values** in this **series**

In [113]:
s.sample(frac=1)

9    17
6    10
5    19
2     4
7    16
4    17
1     6
0    16
3    14
8     5
dtype: int64

**frac=1** means that **100%** of the data will be **returned after shuffling**. Note that the **index was shuffled** as well.

The **index** can be **sorted** again **after the shuffling** using the function **reset_index()**

In [115]:
s.sample(frac=1).reset_index(drop=True)

0     6
1    14
2    16
3     4
4     5
5    17
6    17
7    10
8    16
9    19
dtype: int64

In [116]:
df = pd.read_csv('data/ex3.csv')
df

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2006,18,1. Male,1. Never Married,1. White,1. < HS Grad,75.043154
1,2004,24,1. Male,1. Never Married,1. White,4. College Grad,70.476020
2,2003,45,1. Male,2. Married,1. White,3. Some College,130.982177
3,2003,43,1. Male,2. Married,3. Asian,4. College Grad,154.685293
4,2005,50,1. Male,4. Divorced,1. White,2. HS Grad,75.043154
...,...,...,...,...,...,...,...
2995,2008,44,1. Male,2. Married,1. White,3. Some College,154.685293
2996,2007,30,1. Male,2. Married,1. White,2. HS Grad,99.689464
2997,2005,27,1. Male,2. Married,2. Black,1. < HS Grad,66.229408
2998,2005,27,1. Male,1. Never Married,1. White,3. Some College,87.981033


In [117]:
sample = df.sample(frac=0.2).reset_index(drop=True)
sample

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2003,46,1. Male,2. Married,1. White,2. HS Grad,99.689464
1,2005,50,1. Male,2. Married,4. Other,1. < HS Grad,111.720849
2,2006,23,1. Male,2. Married,1. White,1. < HS Grad,40.405665
3,2003,41,1. Male,5. Separated,1. White,2. HS Grad,70.815039
4,2008,58,1. Male,2. Married,1. White,2. HS Grad,139.213788
...,...,...,...,...,...,...,...
595,2006,55,1. Male,2. Married,1. White,2. HS Grad,137.590143
596,2009,44,1. Male,2. Married,1. White,4. College Grad,141.775172
597,2009,51,1. Male,2. Married,1. White,2. HS Grad,118.019753
598,2008,27,1. Male,2. Married,1. White,3. Some College,63.188861


We can also **select** a **random sample** based on the **number of rows** rather than fraction, here we choose a sybset of 100 rows:

In [118]:
sample = df.sample(n=100).reset_index(drop=True)
sample

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2003,33,1. Male,1. Never Married,1. White,4. College Grad,114.475713
1,2007,37,1. Male,1. Never Married,1. White,4. College Grad,114.475713
2,2009,47,1. Male,2. Married,2. Black,3. Some College,92.895845
3,2005,38,1. Male,2. Married,1. White,3. Some College,141.775172
4,2003,37,1. Male,2. Married,1. White,2. HS Grad,94.072715
...,...,...,...,...,...,...,...
95,2009,59,1. Male,3. Widowed,2. Black,2. HS Grad,104.921507
96,2004,31,1. Male,2. Married,2. Black,2. HS Grad,84.045958
97,2003,62,1. Male,3. Widowed,1. White,1. < HS Grad,62.030305
98,2004,36,1. Male,2. Married,1. White,4. College Grad,94.654020


## 9) Dummy Variables

- **Categorical variables** need to be converted into **dummy variables** to be used for **statistical modeling** or **machine learning models**.
- **Number** of created **dummy variable**s equals the **number of distinct values** in the **categorical variable**.

In [119]:
df = pd.read_csv('data/ex4.csv')
df.head()

Unnamed: 0,year,age,sex,marital,race,education,wage
0,2006,18,Male,Never Married,White,< HS Grad,75.043154
1,2004,24,Male,Never Married,White,College Grad,70.47602
2,2003,45,Male,Married,Black,Some College,130.982177
3,2003,43,Female,Married,Asian,College Grad,154.685293
4,2005,50,Male,Divorced,White,HS Grad,75.043154


**pd.get_dummies()**: To **create dummy variables** from the **categorical variable**

In [120]:
marital = pd.get_dummies(df['marital'])
marital

Unnamed: 0,Divorced,Married,Never Married
0,0,0,1
1,0,0,1
2,0,1,0
3,0,1,0
4,1,0,0
5,0,1,0
6,0,1,0
7,0,0,1
8,0,0,1
9,0,1,0


In [122]:
pd.get_dummies(df[['marital', 'sex']])

Unnamed: 0,marital_ Divorced,marital_ Married,marital_ Never Married,sex_ Male,sex_Female
0,0,0,1,1,0
1,0,0,1,1,0
2,0,1,0,1,0
3,0,1,0,0,1
4,1,0,0,1,0
5,0,1,0,1,0
6,0,1,0,0,1
7,0,0,1,1,0
8,0,0,1,0,1
9,0,1,0,1,0


**join()**: To **add** the **created dummy variables** to the **dataframe**

In [123]:
new_df = df.join(marital)
new_df

Unnamed: 0,year,age,sex,marital,race,education,wage,Divorced,Married,Never Married
0,2006,18,Male,Never Married,White,< HS Grad,75.043154,0,0,1
1,2004,24,Male,Never Married,White,College Grad,70.47602,0,0,1
2,2003,45,Male,Married,Black,Some College,130.982177,0,1,0
3,2003,43,Female,Married,Asian,College Grad,154.685293,0,1,0
4,2005,50,Male,Divorced,White,HS Grad,75.043154,1,0,0
5,2008,54,Male,Married,White,College Grad,127.115744,0,1,0
6,2009,44,Female,Married,White,Some College,169.528538,0,1,0
7,2008,30,Male,Never Married,Asian,Some College,111.720849,0,0,1
8,2006,41,Female,Never Married,Black,Some College,118.884359,0,0,1
9,2004,52,Male,Married,White,HS Grad,128.680488,0,1,0


## 10) String Object Methods

In [124]:
text1 = 'jone, sam, jake'

**split()**: To **split text** into **words** by using the comma as a **separator**

In [127]:
text1.split(',')

['jone', ' sam', ' jake']

**stirp()**: To **delete the white space** when splitting a text 

In [128]:
words = [x.strip() for x in text1.split(',')]
words

['jone', 'sam', 'jake']

In [129]:
text2 = 'Sam will go to the school today'
text2.split(' ')

['Sam', 'will', 'go', 'to', 'the', 'school', 'today']

In [130]:
text3 = ['sam', 'yahoo.com']
'@'.join(text3)

'sam@yahoo.com'

In [131]:
'school' in text2

True

In [132]:
text2.index('school')

19

In [133]:
text2.find('school')

19

In [134]:
text2.find('jone')

-1

In [135]:
text2.count('to')

2

In [136]:
text4 = 'sam:jake:jone'

In [137]:
text4.replace(':', ', ')

'sam, jake, jone'