In [1]:
import pandas as pd
import numpy as np
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df=pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08
make = df.make

  df=pd.read_csv(url)


# String Manipulation

The `make` column has an `object` type by default:

In [2]:
make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

can convert to a string type by using the `.astype` method:

In [3]:
make.astype('string')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string

In [4]:
make.astype('category') #useful whenyou have low cardinality in string columns, as operations are only done on individual categories and not each value in the string

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

The `object`, `'string'`, and `'category'` types have a `.str` accessor that provides string manipulation:

In [5]:
'Ford'.lower()

'ford'

In [6]:
make.str.lower()

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: object

In [7]:
'Alfa Romeo'.find('a')

3

## 11.4: Searching
To find all of the non alphabetic characters (disregarding space), can use:

In [12]:
make.str.extract(r'([^a-z A-Z])') #returns missing value DF

Unnamed: 0,0
0,
1,
2,
3,
4,
...,...
41139,
41140,
41141,
41142,


if we collapse it into a series, with parameter `expand=False`, we can chain the   `.value_counts` method to view the count of non-missing values


In [11]:
(make
 .str.extract(r'([^a-z A-Z])', expand=False)
 .value_counts()
)


make
-    1727
.      46
,       9
Name: count, dtype: int64

If a column in a CSV file contains non-numeric characters, use the following code to find them:

In [14]:
(col
 .str.extract(r'(^0-9.])', expand=False)
 .value_counts()
 )

Series([], Name: count, dtype: int64)

### 11.5 Splitting
When dealing with survey data, can come across binned numeric values.
Here is an example of pulling out the value before the dash and converting it to a number using the `.split` method:

In [17]:
age = pd.Series(['0-10', '11-15', '11-15', '61-65', '46-50'])
age.str.split('-')

0     [0, 10]
1    [11, 15]
2    [11, 15]
3    [61, 65]
4    [46, 50]
dtype: object

Having a series with a Python list makes it hard to manipulate the data, to remedy that, we can provide the `expand=Tru` to retrieve a df. 

To use just the first column as an age value, chain with an `.iloc` operation to pull out first column, and then convert strings to integers with `.astype` method:

In [18]:
(age
 .str.split('-', expand=True)
 .iloc[:,0]
 .astype(int)
 )

0     0
1    11
2    11
3    61
4    46
Name: 0, dtype: int64

This will bias ages towards the low side, and if you want to use the tail end of the binned value, can use `.slice` method or do a slice operation off of `.str`:

In [19]:
(age
 .str.slice(-2)
 .astype(int)
 )

0    10
1    15
2    15
3    65
4    50
dtype: int64

In [21]:
(age
 .str[-2:]
 .astype(int)
 )

0    10
1    15
2    15
3    65
4    50
dtype: int64

In [23]:
#can take average with:
(age
 .str.split('-', expand=True)
 .astype(int)
 .mean(axis='columns')
 )

0     5.0
1    13.0
2    13.0
3    63.0
4    48.0
dtype: float64

In [25]:
#to get random number between ranges:
import random
def between(row):
    return random.randint(*row.values)
(age
 .str.split('-', expand=True)
 .astype(int)
 .apply(between, axis='columns')
 )

0     5
1    11
2    12
3    65
4    47
dtype: int64

### Optimization  `.apply` with Cython
To enable in Jupyter:
`%load_ext Cython`
Then cythonize the `between` fn as a first step:
```
%%cython
import random
def between_cy(row):
    return random.rantint(*row.values)
```
This is no faster than current code, but if you add types to Cython code:
```
%%cython
import random
cpdef int between_cy3(int x, int y):
    return random.rantint(x, y)
```
Since we are calling `.apply` across the columns axis, the `between` fn needs to work on a row of data: Using `lambda` to pull apart the series and then call `between_cy3`:
```
(age
 .str.split('-', expand=True)
 .astype(int)
 .apply(lambda row: between_cy3(row[0], row[1]), axis=1)
 )
```
Still not much of a boost, can use `prun` to see where computation time is being spent 

After prun, figure out that sending numpy array is smart:
```
%%cython
cimpport numpy as np
import numpy as np
import random
cpdef np.ndarray[int] apply_between_cy4(np.ndarray[int] x, np.ndarray[int] y):
    cdef np.ndarray[int] res = np.empty(len(x), dtype='int32')
    for i in range(len(x)):
        res[i] = random.randint(x[i], y[i])
    return res

```
and now this runs 8x faster on a dataset with 500k values:
```
(age
    .str.split('-', expand=True)
    .astype (int)
    .pipe(lambda df_: apply_between_cy4(df_.iloc[:,0].to_numpy( dtype = 'int32') ,
                                        df_.iloc[:,1].to_numpy( dtype = 'int32 )))
)

```

In [30]:
%load_ext Cython

ModuleNotFoundError: No module named 'Cython'

In [29]:
%%cython

import numpy as np
import random
cpdef np.ndarray[int] apply_between_cy4(np.ndarray[int] x, np.ndarray[int] y):
    cdef np.ndarray[int] res = np.empty(len(x), dtype='int32')
    for i in range(len(x)):
        res[i] = random.randint(x[i], y[i])
    return res


SyntaxError: invalid syntax (169765933.py, line 6)

### 11.7 Replacing Text
`.replace` method is both in series and the `.str` attribute.
- if you want to replace single characters, use `.str.replace`, but for complete replacements for many values, use `.replace`

In [34]:
make.str.replace('a', 'Å')

0        AlfÅ Romeo
1           FerrÅri
2             Dodge
3             Dodge
4            SubÅru
            ...    
41139        SubÅru
41140        SubÅru
41141        SubÅru
41142        SubÅru
41143        SubÅru
Name: make, Length: 41144, dtype: object

In [35]:
make.replace('A', 'Å') #no makes with the name 'A'

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

- can use dictionary to specify complete replacements (very explicit, but might be problematic if there are 20,000 numeric values with dashes in them and you wanted to strip out all the dashes for 20,000 numbers. Need to make dictionary with all entries, tedious work.):


In [37]:
make.replace({'Audi': 'Åudi', 'Acura': 'Åcura',
              'Ashton Martin': 'Åshton Martin',
              'Alfa Romeo': 'Ålfa Romeo'})

0        Ålfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

Alternatively, can specify that you mean to use a regex to replace a portion of the strings with `regex=True`

In [38]:
make.replace('A' , 'Å' , regex = True )

0        Ålfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object