<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Formatting (strings)

This is a process where we make sure we have the right data type.

It is important to realize that while you do formatting you may find again the need for cleaning.


# String case

In [None]:
from IPython.display import IFrame  
wikiLink2="https://en.wikipedia.org/wiki/List_of_active_rebel_groups" 
IFrame(wikiLink2, width=900, height=500)

In [None]:
import pandas as pd

rebels = pd.read_html(wikiLink2,flavor='bs4',
                        attrs = {'class': 'wikitable sortable'})
# how many did you get?
len(rebels)

So, this is our table:

In [None]:
theRebels=rebels[0].copy()
theRebels

In [None]:
# check names in columns

theRebels.columns

Simpler names

In [None]:
theRebels.columns=theRebels.columns.str.replace(r'\W',"",regex=True)

Keeping what you need:

In [None]:
theRebels.drop(columns=['References'],inplace=True)

Let's check the first column:

In [None]:
theRebels.iloc[:,0]

That gave me an idea, creat a column multinational for the rebel group. You have two ways:

In [None]:
theRebels[theRebels.Withinstate.str.startswith('Multinational')]

In [None]:
theRebels[theRebels.Withinstate.str.contains('Multinational',case=False)] # not case sensitive

In [None]:
theRebels['multinational']=theRebels.Withinstate.str.contains('Multinational',case=False)

The first should only keep country names. 
Remmeber that we had **country flags**. Take a look:

In [None]:
theRebels.Withinstate[0]

When you see unicode You must find about the presence of those symbols:

In [None]:
# extract unicode sequence, do not show missing values, just unique strings
theRebels.Withinstate.str.extract(r'([^\x00-\x7F]+)')[0].dropna().unique()


Then, this code will take care of the first part:

In [None]:
theRebels.Withinstate=theRebels.Withinstate.str.replace(pat=r'Multinational:\s*\xa0', #with spaces before unicode
                                                        repl="",case=False,
                                                        regex=True)

We can replace the unicode by commas now:

In [None]:
theRebels.Withinstate[1]

In [None]:
theRebels.Withinstate.str.replace(r'\s*\xa0',",",regex=True)#with spaces before unicode

In [None]:
theRebels.Withinstate=theRebels.Withinstate.str.replace(r'\s*\xa0',",",regex=True)

The capitalization is an important decision:

* str.lower(): all to lowercase.

* str.upper(): ALL TO UPPERCASE.

* str.title(): First Character Of Each Word To Uppercase. 

* str.capitalize(): First character to uppercase and remaining to lowercase.

**You can only apply this if the cells are strings.**

For example:

In [None]:
theRebels['state']=theRebels.Withinstate.str.lower()
theRebels['STATE']=theRebels.Withinstate.str.upper()
theRebels['StateName']=theRebels.Withinstate.str.title()

In [None]:
theRebels[['Withinstate','state','STATE','StateName']].head()

Let's keep the upper case:

In [None]:
theRebels.drop(columns=['Withinstate','state','StateName'],inplace=True)

In [None]:
theRebels

In [None]:
# we can save this:
import os

theRebels.to_csv(os.path.join("data","theRebelsCleaned.csv"),index=False)

Exercise:

Create a column that informs in how many states is a rebel group present.