<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 10</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Working with Text & Time Data</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [WORKING WITH TEXT DATA](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#00)
* [WORKING WITH TIME DATA](#1)
    * [String Methods](#1.1)
    * [Most Usefull String Methods](#1.2)
    * [Dummy Operations](#1.3)
* [WORKING WITH TIME DATA](#2)
    * [pd.to_datetime()](#2.1)
    * [Series.dt()](#2.2)
    * [Datetime Module](#2.3)
    * [Series.dt()](#2.4)
* [OPERATION WITH DATETIME OBJECT](#3)
* [THE END OF THE SESSION](#4)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="00"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Text Data</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In this notebook, we will first discuss the string operations with our basic Series/Index and learn how to apply these string functions on the DataFrame.

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions [Refer To Official Python Documentation]( https://docs.python.org/3/library/stdtypes.html#string-methods). So, while studying with the Series Object, convert it to String Object and then perform the operation.

In addition, according to [Pandas Official Document](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), there are two ways to store text data in pandas:
- object -dtype NumPy array.
- StringDtype extension type.

Pandas recommend using StringDtype to store text data.

[SOURCE01](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), [SOURCE02](https://www.w3schools.com/python/python_ref_string.asp)

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">String Methods</p>

<a id="1.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Strings implement all of the common sequence operations, along with the additional methods described at [the official documentation](https://docs.python.org/3/library/stdtypes.html#string-methods).

Strings also support two styles of string formatting, one providing a large degree of flexibility and customization (**Please see the information about** [str.format()](https://docs.python.org/3/library/stdtypes.html#str.format), [Format String Syntax](https://docs.python.org/3/library/string.html#formatstrings) and [Custom String Formatting](https://docs.python.org/3/library/string.html#string-formatting)) and the other based on C printf style formatting that handles a narrower range of types and is slightly harder to use correctly, but is often faster for the cases it can handle ([printf-style String Formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)).

The [Text Processing Services](https://docs.python.org/3/library/text.html#textservices) section of the standard library covers a number of other modules that provide various text related utilities (including regular expression support in the [re](https://docs.python.org/3/library/re.html#module-re) module).

Please watch [**``Video Source``**](https://www.youtube.com/watch?v=6JNwK6hEneg) for enhancing your understanding of working with Text Data in Pandas.  

**What are these String Methods? Now let us examine some of the most common and usefull String Methods and dig into them one by one:**

In [2]:
df = pd.read_excel("text_exercise.xlsx")
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          8 non-null      object
 1   staff       8 non-null      object
 2   department  8 non-null      object
 3   job         8 non-null      object
 4   salary      8 non-null      object
 5   age         8 non-null      object
dtypes: object(6)
memory usage: 512.0+ bytes


In [4]:
type(df["age"][0])

int

In [5]:
type(df["age"][4])

str

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Most Usefull String Methods</p>

<a id="1.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **str.lower() =>** Converts a string into lower case
- **str.upper() =>** Converts a string into upper case
- **str.capitalize() =>** Converts the first character to upper case
- **str.title() =>** Converts the first character of each word to upper case
- **str.swapcase() =>** Swaps the case lower/upper

[SOURCE01](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm)
[SOURCE02](https://www.aboutdatablog.com/post/10-most-useful-string-functions-in-pandas)
[SOURCE03](https://towardsdatascience.com/5-must-know-pandas-operations-on-strings-4f88ca6b8e25)
[SOURCE04](https://towardsdatascience.com/pandas-string-operations-explained-fdfab7602fb4)
[SOURCE05](https://blog.devgenius.io/string-operations-on-pandas-dataframe-88af220439d1)
[SOURCE06](https://www.geeksforgeeks.org/string-manipulations-in-pandas-dataframe/)

___

In [6]:
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [7]:
df["staff"]

0         Tom BLUE
1       JOHN BLACK
2    Micheal Brown
3     jason walker
4       Alex Green
5      OSCAR SMİTH
6      Adrian STAR
7     Albert simon
Name: staff, dtype: object

In [8]:
# df["staff"].lower    ## bu çalışmaz çünkü lower python built-in func. pandas string methods için :

In [9]:
df["staff"].str.lower()

0         tom blue
1       john black
2    micheal brown
3     jason walker
4       alex green
5     oscar smi̇th
6      adrian star
7     albert simon
Name: staff, dtype: object

In [10]:
df["staff"].str.upper()

0         TOM BLUE
1       JOHN BLACK
2    MICHEAL BROWN
3     JASON WALKER
4       ALEX GREEN
5      OSCAR SMİTH
6      ADRIAN STAR
7     ALBERT SIMON
Name: staff, dtype: object

In [11]:
df["staff"].str.title()

0         Tom Blue
1       John Black
2    Micheal Brown
3     Jason Walker
4       Alex Green
5     Oscar Smi̇th
6      Adrian Star
7     Albert Simon
Name: staff, dtype: object

In [12]:
df["staff"].str.capitalize()        # sadece ilk elemanı büyük yapıyordu.

0         Tom blue
1       John black
2    Micheal brown
3     Jason walker
4       Alex green
5     Oscar smi̇th
6      Adrian star
7     Albert simon
Name: staff, dtype: object

In [13]:
df["staff"].str.swapcase()   # büyük gördüğünü küçük küçük gördüğünü büyük yapıyordu.

0         tOM blue
1       john black
2    mICHEAL bROWN
3     JASON WALKER
4       aLEX gREEN
5     oscar smi̇th
6      aDRIAN star
7     aLBERT SIMON
Name: staff, dtype: object

In [14]:
df["staff"]

0         Tom BLUE
1       JOHN BLACK
2    Micheal Brown
3     jason walker
4       Alex Green
5      OSCAR SMİTH
6      Adrian STAR
7     Albert simon
Name: staff, dtype: object

___

- **str.isalpha()     =>** Returns True if all characters in the string are in the alphabet
- **str.isnumeric()   =>** Returns True if all characters in the string are numeric
- **str.isalnum()     =>** Returns True if all characters in the string are alphanumeric
- **str.endswith()	  =>** Returns true if the string ends with the specified value
- **str.startswith()  =>** Returns true if the string starts with the specified value
- **str.contains()	  =>** Returns a Boolean value True for each element if the substring contains in the element, else False.

[SOURCE01](https://careerkarma.com/blog/python-isalpha-isnumeric-isalnum/)
[SOURCE02](https://careerkarma.com/blog/python-startswith-and-endswith/)
[SOURCE03](https://www.geeksforgeeks.org/python-startswith-endswidth-function/)
[SOURCE04](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.)

___

In [15]:
# isalpho()  serideki dizeleirn hangileri sadece alfabetik karakterlerden oluşuyor buna bakar ve sadece alfabetik karakterlerden
# oluşanlara True döndürür bir tane bile alfabetik karakterden oluşmayan bir eleman varsa False döndürür. örnek boşluk alfabetik
# olmayı engeller

**isalpha()** Function in pandas python checks whether the string consists of alphabetic characters only. It returns True when alphabetic value is present and it returns False when the alphabetic value is not present.

In [16]:
df["job"]

0               manager
1               manager
2        data scientist
3             recruiter
4     backend developer
5    frontend developer
6        data scientist
7        data scientist
Name: job, dtype: object

In [17]:
df["job"].str.isalpha()

0     True
1     True
2    False
3     True
4    False
5    False
6    False
7    False
Name: job, dtype: bool

**isnumeric()** checks whether all characters in each string are numeric. This is equivalent to running the Python string method str. isnumeric() for each element of the Series/Index.

In [18]:
df["age"]

0    52
1    48
2    35
3    38
4     -
5    32
6    40
7    35
Name: age, dtype: object

In [19]:
df["age"].str.isnumeric()

0      NaN
1      NaN
2      NaN
3      NaN
4    False
5      NaN
6      NaN
7      NaN
Name: age, dtype: object

## serinin içindeki elemanlar int ve biz str methodu uygulamaya çalışıyoruz.
## sütunun tipi object str methodu çalışıyor porblem yok ama içindeki elemanlar int .

In [20]:
type(df["age"][0])  

int

In [21]:
df["age"].str.isnumeric()  # bunun düzgün çalışabilmesi için ;

0      NaN
1      NaN
2      NaN
3      NaN
4    False
5      NaN
6      NaN
7      NaN
Name: age, dtype: object

In [22]:
df["age"].astype("str").str.isnumeric() 

0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: age, dtype: bool

**isalnum()** Function in python checks whether the string consists of alphanumeric characters. It returns True when alphanumeric value is present and it returns False when the alphanumeric value is not present. Alphanumeric means a character that is either a letter or a number.

In [23]:
df["salary"]

0     "$150,000"
1     "$180,000"
2     "$150,000"
3    130000dolar
4     "$110,000"
5     "$120,000"
6     "$135,000"
7    125000dolar
Name: salary, dtype: object

In [24]:
df["salary"].str.isalnum()  # alfabetik veya numeric olanlardan oluuyorsa True yoksa False verir.

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
Name: salary, dtype: bool

Pandas **startswith()** tests if the start of each string element matches a pattern. It is yet another method to search and filter text data in Series or Data Frame. This method is Similar to Python’s startswith() method, but has different parameters and it works on Pandas objects only. Hence .str has to be prefixed everytime before calling this method, so that the compiler knows that it’s different from default function.

In [25]:
df["job"]

0               manager
1               manager
2        data scientist
3             recruiter
4     backend developer
5    frontend developer
6        data scientist
7        data scientist
Name: job, dtype: object

In [26]:
df["job"].str.startswith("data")   #data ile başlayanları true diğerlreini false döndürecek.

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7     True
Name: job, dtype: bool

Pandas **endswith()** method is a built-in function that determines whether the given string ends with a specific sequence of characters.

In [27]:
df["job"].str.endswith("per")

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
Name: job, dtype: bool

The **contains()** method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not [SOURCE](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.).

In [28]:
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [29]:
df["job"]

0               manager
1               manager
2        data scientist
3             recruiter
4     backend developer
5    frontend developer
6        data scientist
7        data scientist
Name: job, dtype: object

In [30]:
df["job"].str.contains("data")

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7     True
Name: job, dtype: bool

In [31]:
df["salary"]

0     "$150,000"
1     "$180,000"
2     "$150,000"
3    130000dolar
4     "$110,000"
5     "$120,000"
6     "$135,000"
7    125000dolar
Name: salary, dtype: object

In [32]:
df["salary"].str.contains("[a-z]")               # [a-z] içerisinde alfabedeki harflerden herhangi birisi olan demek.

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
Name: salary, dtype: bool

In [33]:
df["salary"].str.contains("dolar")

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
Name: salary, dtype: bool

we can use these string methods which returning boolean expression for creating condition and so selecting relative rows

In [34]:
df["job"].str.contains("data")

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7     True
Name: job, dtype: bool

In [35]:
df.loc[df["job"].str.contains("data")]

Unnamed: 0,id,staff,department,job,salary,age
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [36]:
df.loc[df["job"].str.contains("data"),"department"] = "DS"
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,DS,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,DS,data scientist,"""$135,000""",40
7,E0006,Albert simon,DS,data scientist,125000dolar,35


In [37]:
df.loc[df["job"].str.contains("data"),"department"] = "IT"
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


___

- **str.strip()	=>** Returns a trimmed version of the string

- **str.replace() =>** Returns a string where a specified value is replaced with a specified value

- **str.split()	=>** Splits the string at the specified separator, and returns a list

- **str.find()	=>** Searches the string for a specified value and returns the position of where it was found

- **str.findall()	=>** Returns a list of all occurrence of the pattern.

- **str.join()	=>** Converts the elements of an iterable into a string

___

In [38]:
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


**NOTE:** For a better using and understanding of strip, please revise escape characters in python [Source01 for Escape Characters](https://www.python-ds.com/python-3-escape-sequences) & [Source02 for Escape Characters](https://www.w3schools.com/python/gloss_python_escape_characters.asp)

In [39]:
df["salary"]

0     "$150,000"
1     "$180,000"
2     "$150,000"
3    130000dolar
4     "$110,000"
5     "$120,000"
6     "$135,000"
7    125000dolar
Name: salary, dtype: object

In [40]:
df["salary"].str.strip('"')

0       $150,000
1       $180,000
2       $150,000
3    130000dolar
4       $110,000
5       $120,000
6       $135,000
7    125000dolar
Name: salary, dtype: object

In [41]:
#or
df["salary"].str.strip("\"")

0       $150,000
1       $180,000
2       $150,000
3    130000dolar
4       $110,000
5       $120,000
6       $135,000
7    125000dolar
Name: salary, dtype: object

In [42]:
df["salary"].str.strip("\"").str.rstrip("dolar")

0    $150,000
1    $180,000
2    $150,000
3      130000
4    $110,000
5    $120,000
6    $135,000
7      125000
Name: salary, dtype: object

In [43]:
df["salary"].str.strip("\"").str.rstrip("dolar").str.lstrip("$")

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

## şöyle yapsakydık da olurdu :

In [44]:
df["salary"].str.strip("\"dolar$")

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

In [45]:
## ortada virgüller kaldı onu nasıl kaldıracağız ?

In [46]:
df["salary"].str.strip("\"dolar$").replace(",","")   

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

## normalde replace'İ kullanabiliyordum böyle ama istediğimi alamadım bunun nedeni : 
### str.replace() ' içerisine yazdığımız ifadeyi satırlarda substring olarak arar. ve işlemi yapar ama replace  satırın TAMAMINA BAKAR O İÇERİSİNDE YAZAN SATIRLA UYUŞUYORSA İŞLEM YAPAR YOKSA İŞLEM YAPMAZ.

In [47]:
df["salary"].str.strip("\"dolar$").str.replace(",","")   
## str.replace yapınca benim istediğimi yaptı

0    150000
1    180000
2    150000
3    130000
4    110000
5    120000
6    135000
7    125000
Name: salary, dtype: object

In [48]:
##hala data type sıkıntısı var onu da çözelim .

In [49]:
df["salary"].str.strip("\"dolar$").str.replace(",","").astype("int")   

0    150000
1    180000
2    150000
3    130000
4    110000
5    120000
6    135000
7    125000
Name: salary, dtype: int32

In [50]:
df["salary"] = df["salary"].str.strip("\"dolar$").str.replace(",","").astype("int")   

In [51]:
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,150000,52
1,M0002,JOHN BLACK,IT,manager,180000,48
2,E0001,Micheal Brown,IT,data scientist,150000,35
3,E0002,jason walker,HR,recruiter,130000,38
4,E0003,Alex Green,IT,backend developer,110000,-
5,E0004,OSCAR SMİTH,IT,frontend developer,120000,32
6,E0005,Adrian STAR,IT,data scientist,135000,40
7,E0006,Albert simon,IT,data scientist,125000,35


## BİR SUBSTİNGTE DEĞİŞİKLİK YAPACAKSAK EĞER str.replace() kullanacağız demek ki.

### ``str.replace()`` vs **``.replace()``

- **Purpose:** Use **str.replace** for substring replacements on a single string column, and **replace** for any general replacement on one or more columns.

- **Usage:** **str.replace** can replace one thing at a time. **replace** lets you perform multiple independent replacements, i.e., replace many things at once.

- **Default behavior:** **str.replace** enables regex replacement by default. **replace** only performs a full match unless the regex=True switch is used.

### diğer bir fark str.replace substringlerde bir arama yaptığından biz yeni değeri de bir string olarak vermek zorundayız buna karşılık replace tüm değerde bir arama yapyığından yeni değer olarak istediğimiz data tipindeki veriyi verebiliriz.

In [52]:
df["age"]

0    52
1    48
2    35
3    38
4     -
5    32
6    40
7    35
Name: age, dtype: object

In [53]:
df["age"].replace(to_replace="-",value=np.nan).astype("float")

0    52.0
1    48.0
2    35.0
3    38.0
4     NaN
5    32.0
6    40.0
7    35.0
Name: age, dtype: float64

In [54]:
df["age"] = df["age"].replace(to_replace="-",value=np.nan).astype("float")

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          8 non-null      object 
 1   staff       8 non-null      object 
 2   department  8 non-null      object 
 3   job         8 non-null      object 
 4   salary      8 non-null      int32  
 5   age         7 non-null      float64
dtypes: float64(1), int32(1), object(4)
memory usage: 480.0+ bytes


**Indexing with .str[]** 

You can use [] notation to directly index by position locations [SOURCE](https://pandas.pydata.org/pandas-docs/version/0.15/text.html). 

## ***********str[] ile her bir satırın üzerinde dolaşıp indexleme yapabiliyoruz***********************

In [56]:
df["staff"]

0         Tom BLUE
1       JOHN BLACK
2    Micheal Brown
3     jason walker
4       Alex Green
5      OSCAR SMİTH
6      Adrian STAR
7     Albert simon
Name: staff, dtype: object

In [57]:
df["staff"].str.split()

0         [Tom, BLUE]
1       [JOHN, BLACK]
2    [Micheal, Brown]
3     [jason, walker]
4       [Alex, Green]
5      [OSCAR, SMİTH]
6      [Adrian, STAR]
7     [Albert, simon]
Name: staff, dtype: object

In [58]:
df["staff"].str.split().str[0].str.title()

0        Tom
1       John
2    Micheal
3      Jason
4       Alex
5      Oscar
6     Adrian
7     Albert
Name: staff, dtype: object

In [59]:
df["first_name"] = df["staff"].str.split().str[0].str.title()

In [60]:
df

Unnamed: 0,id,staff,department,job,salary,age,first_name
0,M0001,Tom BLUE,HR,manager,150000,52.0,Tom
1,M0002,JOHN BLACK,IT,manager,180000,48.0,John
2,E0001,Micheal Brown,IT,data scientist,150000,35.0,Micheal
3,E0002,jason walker,HR,recruiter,130000,38.0,Jason
4,E0003,Alex Green,IT,backend developer,110000,,Alex
5,E0004,OSCAR SMİTH,IT,frontend developer,120000,32.0,Oscar
6,E0005,Adrian STAR,IT,data scientist,135000,40.0,Adrian
7,E0006,Albert simon,IT,data scientist,125000,35.0,Albert


In [61]:
df["last_name"] = df["staff"].str.split().str[1].str.title()

In [62]:
df.drop("staff",axis=1,inplace=True)
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name
0,M0001,HR,manager,150000,52.0,Tom,Blue
1,M0002,IT,manager,180000,48.0,John,Black
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown
3,E0002,HR,recruiter,130000,38.0,Jason,Walker
4,E0003,IT,backend developer,110000,,Alex,Green
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th
6,E0005,IT,data scientist,135000,40.0,Adrian,Star
7,E0006,IT,data scientist,125000,35.0,Albert,Simon


**str.find** returns lowest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

**str.rfind** returns highest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

**str.findall** finds all occurrences of pattern or regular expression in the Series/Index [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.findall.html).

In [63]:
df["job"]

0               manager
1               manager
2        data scientist
3             recruiter
4     backend developer
5    frontend developer
6        data scientist
7        data scientist
Name: job, dtype: object

In [64]:
df["job"].str.find("developer")

0   -1
1   -1
2   -1
3   -1
4    8
5    9
6   -1
7   -1
Name: job, dtype: int64

## -1 veriyorsa o satırda developer geçmiyor demektir -1 dışında herhangibir sayı veriyorsa eğer bu da developer ın ilk karakterinin index numarasıdır.

In [65]:
df["job"].str.findall("developer")

0             []
1             []
2             []
3             []
4    [developer]
5    [developer]
6             []
7             []
Name: job, dtype: object

In [66]:
df["job"].str.findall("d")

0        []
1        []
2       [d]
3        []
4    [d, d]
5    [d, d]
6       [d]
7       [d]
Name: job, dtype: object

In [67]:
df["job"].str.findall("d").apply(len)

0    0
1    0
2    1
3    0
4    2
5    2
6    1
7    1
Name: job, dtype: int64

In [68]:
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name
0,M0001,HR,manager,150000,52.0,Tom,Blue
1,M0002,IT,manager,180000,48.0,John,Black
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown
3,E0002,HR,recruiter,130000,38.0,Jason,Walker
4,E0003,IT,backend developer,110000,,Alex,Green
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th
6,E0005,IT,data scientist,135000,40.0,Adrian,Star
7,E0006,IT,data scientist,125000,35.0,Albert,Simon


In [69]:
df["skills"] = [[],["Java","C++"],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df["Skills"] = [[],[],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df.loc[1, "Skills"] = "Java,C++"
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,HR,manager,150000,52.0,Tom,Blue,[],[]
1,M0002,IT,manager,180000,48.0,John,Black,"[Java, C++]","Java,C++"
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown,"[Python, Tableau, SQL]","[Python, Tableau, SQL]"
3,E0002,HR,recruiter,130000,38.0,Jason,Walker,[],[]
4,E0003,IT,backend developer,110000,,Alex,Green,"[React, Django]","[React, Django]"
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th,"[JavaScript, Python]","[JavaScript, Python]"
6,E0005,IT,data scientist,135000,40.0,Adrian,Star,"[R, SQL]","[R, SQL]"
7,E0006,IT,data scientist,125000,35.0,Albert,Simon,"[SQL, Python]","[SQL, Python]"


If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to str.join() [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html).

**Join** lists contained as elements in the Series/Index with passed delimiter.

In [70]:
df["skills"].str.join("-")   
#str join liste içerisindeki elamanları içerisine yazacağımız stringi aralarına koymak suretiyle tek bir string haline getiriyor

0                      
1              Java-C++
2    Python-Tableau-SQL
3                      
4          React-Django
5     JavaScript-Python
6                 R-SQL
7            SQL-Python
Name: skills, dtype: object

In [71]:
df["skills"].str.join(",")

0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: skills, dtype: object

## BUNUN AYNISINI DİĞER SÜTUNA YAPSAYDIK NE OLURDU ? 

In [72]:
df["Skills"].str.join(",")

0                      
1       J,a,v,a,,,C,+,+
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

## LİSTE İÇERİSİNDE OLMAYAN  İTERABLE ELEMANI DA ELAMANLARINA AYIRDI VE VİRGÜL KOYDU BU ASLINDA BİZİM İSTEMEDİĞİMİZ BİR ŞEY

## PEKİ BİZ BÖYLE BİR DURUMDA NE YAPACAĞIZ YANİ ; SÜTUNDAKİ BAZI İFADELER LİSTE İÇERİSİNDE BAZILARI İSE DEĞİL ;  GENELİ SAĞLAYACAK BİR KOD YAZMALIYIZ :

In [73]:
df["Skills"].apply(lambda x : ",".join(x)   if type(x) == list else  x )

0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

## her bir satırın data type liste mi değil mi diye kontrol etmek istiyoruz ;
## DİKKAT str.join kullanmadık burada built-in join kullandık burada çünkü buraya eleman geliyor bir seri gelmiyor.

# bunu şöyle de yazabilirdik : 

In [74]:
[",".join(x)   if type(x) == list else  x for x in df["Skills"] ]

['',
 'Java,C++',
 'Python,Tableau,SQL',
 '',
 'React,Django',
 'JavaScript,Python',
 'R,SQL',
 'SQL,Python']

In [75]:
## apply bunu kendi for ile yapıyor zaten 

In [76]:
df["Skills"] = [",".join(x)   if type(x) == list else  x for x in df["Skills"] ]
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,HR,manager,150000,52.0,Tom,Blue,[],
1,M0002,IT,manager,180000,48.0,John,Black,"[Java, C++]","Java,C++"
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown,"[Python, Tableau, SQL]","Python,Tableau,SQL"
3,E0002,HR,recruiter,130000,38.0,Jason,Walker,[],
4,E0003,IT,backend developer,110000,,Alex,Green,"[React, Django]","React,Django"
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th,"[JavaScript, Python]","JavaScript,Python"
6,E0005,IT,data scientist,135000,40.0,Adrian,Star,"[R, SQL]","R,SQL"
7,E0006,IT,data scientist,125000,35.0,Albert,Simon,"[SQL, Python]","SQL,Python"


In [77]:
## listeden çıkardık ki dummy yapacağız : 

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Dummy Operations</p>

<a id="1.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables. A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value [SOURCE](https://www.geeksforgeeks.org/how-to-create-dummy-variables-in-python-with-pandas/).

### get_dummies()

**Syntax1:** ``pd.get_dummies(data, prefix=None, prefix_sep="_",)``<br>
            **OR**<br>
**Syntax2:** ``df["col_name"].get_dummies(sep = ",")``

**Parameters:**
- data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
- prefix= Initial value
- prefix_sep= Data values separation.
- Return Type: Dummy variables.

In [78]:
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,HR,manager,150000,52.0,Tom,Blue,[],
1,M0002,IT,manager,180000,48.0,John,Black,"[Java, C++]","Java,C++"
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown,"[Python, Tableau, SQL]","Python,Tableau,SQL"
3,E0002,HR,recruiter,130000,38.0,Jason,Walker,[],
4,E0003,IT,backend developer,110000,,Alex,Green,"[React, Django]","React,Django"
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th,"[JavaScript, Python]","JavaScript,Python"
6,E0005,IT,data scientist,135000,40.0,Adrian,Star,"[R, SQL]","R,SQL"
7,E0006,IT,data scientist,125000,35.0,Albert,Simon,"[SQL, Python]","SQL,Python"


In [79]:
df["department"]

0    HR
1    IT
2    IT
3    HR
4    IT
5    IT
6    IT
7    IT
Name: department, dtype: object

In [80]:
## bunlar kategorik veriler ve birbirlerine karşı bir üstünlüğü yok.  pandas get dummies kullanacağız 

In [81]:
pd.get_dummies(data=df["department"])

Unnamed: 0,HR,IT
0,1,0
1,0,1
2,0,1
3,1,0
4,0,1
5,0,1
6,0,1
7,0,1


As you can see two(2) dummy variables are created for the three categorical values of the "department" attribute. We can create dummy variables in python using **``get_dummies()``** method.

Dummies with **``drop_first=True``** parameter can be used to drop the first column. drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable creation. Hence it reduces the correlations created among dummy variables. In other words it drops the first dummy to avoid the creation of correlated features [SOURCE](https://stackoverflow.com/questions/63661560/drop-first-true-during-dummy-variable-creation-in-pandas#:~:text=1%20Answer,correlations%20created%20among%20dummy%20variables.).

In [82]:
pd.get_dummies(data=df["department"],drop_first=True)

Unnamed: 0,IT
0,0
1,1
2,1
3,0
4,1
5,1
6,1
7,1


## drop_first=True ilk sütunu düş demek bunu yapmamızın nedeni zaten iki tane column vardı ben modele verirken bu fazlalık olur  ben ilk column'u düşüreyim ki 1 olanlar ıt'dir 0 olanlar da diğeridir yani veriyi kaybetmiyoruz . bilgisayar isimleri nereden bilsin, 3 sütun varsa da atabiliriz . iki sütun 0 ise o atılandır.

## dikkat

In [83]:
df["Skills"]

0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

In [84]:
df["Skills"].str.get_dummies(sep = ",")

Unnamed: 0,C++,Django,Java,JavaScript,Python,R,React,SQL,Tableau
0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,1
3,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0
5,0,0,0,1,1,0,0,0,0
6,0,0,0,0,0,1,0,1,0
7,0,0,0,0,1,0,0,1,0


In [85]:
## içerisine sep yazdığımıza dikkat edelim.

## biz birden fazla sütuna böyle get dummy yapmak zorunda kalabiliriz ben bu dummy yapılmış sütunların hangi sütundan oluştuğunu görmek için : 

In [86]:
df["Skills"].str.get_dummies(sep = ",").add_prefix("Skills_")

Unnamed: 0,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,1
3,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0
5,0,0,0,1,1,0,0,0,0
6,0,0,0,0,0,1,0,1,0
7,0,0,0,0,1,0,0,1,0


In [87]:
Skill_dummy = df["Skills"].str.get_dummies(sep = ",").add_prefix("Skills_")

In [88]:
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,HR,manager,150000,52.0,Tom,Blue,[],
1,M0002,IT,manager,180000,48.0,John,Black,"[Java, C++]","Java,C++"
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown,"[Python, Tableau, SQL]","Python,Tableau,SQL"
3,E0002,HR,recruiter,130000,38.0,Jason,Walker,[],
4,E0003,IT,backend developer,110000,,Alex,Green,"[React, Django]","React,Django"
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th,"[JavaScript, Python]","JavaScript,Python"
6,E0005,IT,data scientist,135000,40.0,Adrian,Star,"[R, SQL]","R,SQL"
7,E0006,IT,data scientist,125000,35.0,Albert,Simon,"[SQL, Python]","SQL,Python"


In [96]:
df_final = df[["department","job","salary","Skills"]]

In [97]:
## id numaraları yani hepsi unique olanlar , isimler modele sokulmaz !!

In [98]:
df_final

Unnamed: 0,department,job,salary,Skills
0,HR,manager,150000,
1,IT,manager,180000,"Java,C++"
2,IT,data scientist,150000,"Python,Tableau,SQL"
3,HR,recruiter,130000,
4,IT,backend developer,110000,"React,Django"
5,IT,frontend developer,120000,"JavaScript,Python"
6,IT,data scientist,135000,"R,SQL"
7,IT,data scientist,125000,"SQL,Python"


In [99]:
df_final.join(Skill_dummy)

Unnamed: 0,department,job,salary,Skills,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,HR,manager,150000,,0,0,0,0,0,0,0,0,0
1,IT,manager,180000,"Java,C++",1,0,1,0,0,0,0,0,0
2,IT,data scientist,150000,"Python,Tableau,SQL",0,0,0,0,1,0,0,1,1
3,HR,recruiter,130000,,0,0,0,0,0,0,0,0,0
4,IT,backend developer,110000,"React,Django",0,1,0,0,0,0,1,0,0
5,IT,frontend developer,120000,"JavaScript,Python",0,0,0,1,1,0,0,0,0
6,IT,data scientist,135000,"R,SQL",0,0,0,0,0,1,0,1,0
7,IT,data scientist,125000,"SQL,Python",0,0,0,0,1,0,0,1,0


In [100]:
df_final = df_final.join(Skill_dummy)
df_final.drop("Skills",axis=1,inplace=True)   # zaten dummy yaptık bunu siliyoruz

In [102]:
df_final         ## iki tane kategorik değişkenim daha var.

Unnamed: 0,department,job,salary,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,HR,manager,150000,0,0,0,0,0,0,0,0,0
1,IT,manager,180000,1,0,1,0,0,0,0,0,0
2,IT,data scientist,150000,0,0,0,0,1,0,0,1,1
3,HR,recruiter,130000,0,0,0,0,0,0,0,0,0
4,IT,backend developer,110000,0,1,0,0,0,0,1,0,0
5,IT,frontend developer,120000,0,0,0,1,1,0,0,0,0
6,IT,data scientist,135000,0,0,0,0,0,1,0,1,0
7,IT,data scientist,125000,0,0,0,0,1,0,0,1,0


In [106]:
pd.get_dummies(df_final)  ## içerisine df de verebiliyorduk.
## böyle yapınca kategroik gördüğü sütunları get dummy yaptı otomatik olarak prefix'de ekleyerek
#sayısal olan sütunları ise geçti.

Unnamed: 0,salary,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau,department_HR,department_IT,job_backend developer,job_data scientist,job_frontend developer,job_manager,job_recruiter
0,150000,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
1,180000,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0
2,150000,0,0,0,0,1,0,0,1,1,0,1,0,1,0,0,0
3,130000,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
4,110000,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0
5,120000,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0
6,135000,0,0,0,0,0,1,0,1,0,0,1,0,1,0,0,0
7,125000,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0


In [108]:
pd.get_dummies(df_final,drop_first=True)   ## daha verimli.model anlıyor zaten.

Unnamed: 0,salary,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau,department_IT,job_data scientist,job_frontend developer,job_manager,job_recruiter
0,150000,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,180000,1,0,1,0,0,0,0,0,0,1,0,0,1,0
2,150000,0,0,0,0,1,0,0,1,1,1,1,0,0,0
3,130000,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,110000,0,1,0,0,0,0,1,0,0,1,0,0,0,0
5,120000,0,0,0,1,1,0,0,0,0,1,0,1,0,0
6,135000,0,0,0,0,0,1,0,1,0,1,1,0,0,0
7,125000,0,0,0,0,1,0,0,1,0,1,1,0,0,0


In [110]:
df_final = pd.get_dummies(df_final,drop_first=True) 
df_final

Unnamed: 0,salary,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau,department_IT,job_data scientist,job_frontend developer,job_manager,job_recruiter
0,150000,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,180000,1,0,1,0,0,0,0,0,0,1,0,0,1,0
2,150000,0,0,0,0,1,0,0,1,1,1,1,0,0,0
3,130000,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,110000,0,1,0,0,0,0,1,0,0,1,0,0,0,0
5,120000,0,0,0,1,1,0,0,0,0,1,0,1,0,0
6,135000,0,0,0,0,0,1,0,1,0,1,1,0,0,0
7,125000,0,0,0,0,1,0,0,1,0,1,1,0,0,0


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Time Data</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

As someone who works with time series data on almost a daily basis, it's clear that the pandas Python package is extremely useful for time series manipulation and analysis. This basic introduction to time series data manipulation with pandas should allow you to get started in your time series analysis. Specific objectives are to show you how to:
- create a date range
- work with timestamp data
- convert string data to a timestamp
- index and slice your time series data in a data frame
- resample your time series for different time period aggregates/summary statistics
- compute a rolling statistic such as a rolling average
- work with missing data
- understand the basics of unix/epoch time
- understand common pitfalls of time series data analysis [SOURCE](https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea)

In this section, we will introduce how to work with each of these types of date/time data in Pandas. This short section is by no means a complete guide to the time series tools available in Python or Pandas, but instead is intended as a broad overview of how you as a user should approach working with time series [SOURCE](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html).

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">pd.to_datetime()</p>

<a id="2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

For more and detailed information about to_datetime() metod, please [Visit Official Document](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

**``pd.to_datetime()``** Converts argument to datetime.

This function converts a **``scalar``**, **``array-like``**, **``Series``** or **``DataFrame/dict-like``** to a pandas datetime object.

**As stated above, many input types are supported, and lead to different output types:**

- **``scalars``** can be int, float, str, datetime object (from stdlib datetime module or numpy). They are converted to Timestamp when possible, otherwise they are converted to datetime.datetime. None/NaN/null scalars are converted to NaT.

- **``array-like``** can contain int, float, str, datetime objects. They are converted to DatetimeIndex when possible, otherwise they are converted to Index with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``Series``** are converted to Series with datetime64 dtype when possible, otherwise they are converted to Series with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``DataFrame/dict-like``** are converted to Series with datetime64 dtype. For each row a datetime is created from assembling the various dataframe columns. Column keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

[Special Note :](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html)

As many data sets do contain datetime information in one of the columns, pandas input function like [pandas.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) and [pandas.read_json()](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html#pandas.read_json) can do the transformation to dates when reading the data using the **``parse_dates parameter``** with a list of the columns to read as Timestamp.

Why are these **``pandas.Timestamp``** objects useful? Let's illustrate the added value with some example cases. In this sense, let us assume that we want to work with the dates in the column datetime as datetime objects instead of plain text:

In [112]:
df = pd.read_csv("time_exercise.csv")
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date
0,401,2021-01-23,1.0,541.487603,2018-12-04
1,416,2020-04-02,1.0,131.181818,2018-12-04
2,717,2019-03-10,1.0,2035.488500,2018-12-04
3,778,2019-12-27,1.0,335.988000,2018-12-04
4,826,2020-02-19,1.0,342.292302,2018-12-04
...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07
908,1536887,2020-11-22,1.0,0.000000,2020-11-13
909,1536952,2021-01-26,1.0,988.429752,2020-11-24


In [113]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id_product        911 non-null    int64  
 1   order_date        911 non-null    object 
 2   product_quantity  911 non-null    float64
 3   product_price     911 non-null    float64
 4   entry_date        911 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 35.7+ KB


![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [114]:
## bunlar üzerinde çalışmak istiytosak bunları datetime'a çevirmeliyiz.

Initially, the values in datetime are character strings and do **NOT** provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``) objects. In pandas we call these datetime objects similar to datetime.datetime from the standard library as [pandas.Timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp).

In [115]:
df["order_date"]

0      2021-01-23
1      2020-04-02
2      2019-03-10
3      2019-12-27
4      2020-02-19
          ...    
906    2020-11-24
907    2020-11-24
908    2020-11-22
909    2021-01-26
910    2020-12-06
Name: order_date, Length: 911, dtype: object

In [117]:
pd.to_datetime(df["order_date"])   ## tipini datetime'a çeviriyoruz.

0     2021-01-23
1     2020-04-02
2     2019-03-10
3     2019-12-27
4     2020-02-19
         ...    
906   2020-11-24
907   2020-11-24
908   2020-11-22
909   2021-01-26
910   2020-12-06
Name: order_date, Length: 911, dtype: datetime64[ns]

In [118]:
pd.to_datetime(df["entry_date"])

0     2018-12-04
1     2018-12-04
2     2018-12-04
3     2018-12-04
4     2018-12-04
         ...    
906   2020-10-07
907   2020-10-07
908   2020-11-13
909   2020-11-24
910   2020-11-26
Name: entry_date, Length: 911, dtype: datetime64[ns]

In [119]:
df["order_date"] = pd.to_datetime(df["order_date"]) 
df["entry_date"] = pd.to_datetime(df["entry_date"])

In [120]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


Now let's apply some aggregate methods for Datatime object at the given dataset:

In [121]:
df["entry_date"].max()            # sisteme en son giren 

Timestamp('2020-11-26 00:00:00')

In [122]:
df["entry_date"].min()

Timestamp('2018-12-04 00:00:00')

In [124]:
df["entry_date"].max()   - df["entry_date"].min()   # bunlar arasında matematiksel işlem yapabiliriz.

Timedelta('723 days 00:00:00')

In [126]:
## string kalsaydı bunları yapamazdık.ÖRNEK :

In [127]:
a = pd.Series(["15-03-2020", "18-05-2019", "24-07-2018"])
a

0    15-03-2020
1    18-05-2019
2    24-07-2018
dtype: object

In [128]:
a.max()   
# görüldüğü üzere max değerim yanlış ilk karaktere göre aldı ve hata aldık datetime'a çevirmeden böyle şeyler yapılamaz

'24-07-2018'

In [130]:
# a.max()   - a.min()        # mesela hata alırız 

In [132]:
pd.to_datetime(a, format='%d-%m-%Y')       ## format belirtmek önemli pandasın kafası karışmasın *******

0   2020-03-15
1   2019-05-18
2   2018-07-24
dtype: datetime64[ns]

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Series.dt()</p>

<a id="2.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Accessor object for datetimelike properties of the Series values [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html).

For a comprehensive information what the datetimelike properties, please visit [Official Pandas API Reference Document](https://pandas.pydata.org/pandas-docs/version/0.22/api.html#datetimelike-properties)

In [133]:
df["entry_date"]

0     2018-12-04
1     2018-12-04
2     2018-12-04
3     2018-12-04
4     2018-12-04
         ...    
906   2020-10-07
907   2020-10-07
908   2020-11-13
909   2020-11-24
910   2020-11-26
Name: entry_date, Length: 911, dtype: datetime64[ns]

In [140]:
df["entry_date"].dt.year
# it can be date, year, quarter, month, week, day, weekday, dayofweek, hour, minute, second, microsecond ,dayname

0      2018
1      2018
2      2018
3      2018
4      2018
       ... 
906    2020
907    2020
908    2020
909    2020
910    2020
Name: entry_date, Length: 911, dtype: int64

In [141]:
df["entry_date"].dt.quarter

0      4
1      4
2      4
3      4
4      4
      ..
906    4
907    4
908    4
909    4
910    4
Name: entry_date, Length: 911, dtype: int64

In [142]:
df["entry_date"].dt.dayofweek         # haftanın kaçıncı günü.

0      1
1      1
2      1
3      1
4      1
      ..
906    2
907    2
908    4
909    1
910    3
Name: entry_date, Length: 911, dtype: int64

In [143]:
## buraya kadar gördüklerimiz pandasın içerisindeydi.

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Datetime Module</p>

<a id="2.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

The datetime module supplies classes for manipulating dates and times [SOURCE](https://docs.python.org/3/library/datetime.html).

### ``class datetime.datetime``

A combination of a date and a time. Attributes: year, month, day, hour, minute, second, microsecond, and tzinfo.

In [144]:
from datetime import datetime

In [145]:
datetime.now()

datetime.datetime(2022, 7, 29, 9, 26, 18, 504281)

In [146]:
print(datetime.now())

2022-07-29 09:26:45.613333


In [148]:
print(datetime.today())       ## now,today aynısını verir.

2022-07-29 09:27:34.704577


In [149]:
current_datetime  = datetime.now()
print(current_datetime)

2022-07-29 09:28:22.108203


In [152]:
current_datetime.date()      ## dt ile tüm bir seriye uygularken burada normal bir time ifadesine time methodları uyguluyoruz.
                         # mesela datetim'ı seride kullanacaksak apply kullanmak zorundayız .

datetime.date(2022, 7, 29)

### ``class datetime.timedelta``

A duration expressing the difference between two date, time, or datetime instances to microsecond resolution [SOURCE](https://www.geeksforgeeks.org/manipulate-date-and-time-with-the-datetime-module-in-python/).

In [155]:
from datetime import timedelta

In [156]:
timedelta(days=2)

datetime.timedelta(days=2)

In [157]:
current_datetime

datetime.datetime(2022, 7, 29, 9, 28, 22, 108203)

In [159]:
two_Days_before = current_datetime - timedelta(days=2)
two_Days_before

datetime.datetime(2022, 7, 27, 9, 28, 22, 108203)

In [160]:
current_datetime + timedelta(weeks=2, days=3, hours=4, minutes=10)

datetime.datetime(2022, 8, 15, 13, 38, 22, 108203)

In [161]:
print(f"{'current_date': <15}", current_datetime)

print(f"{'plus': <15}", timedelta(weeks=2, hours=4, minutes=10))

print(f"{'total': <15}", current_datetime + timedelta(weeks=2, days=3, hours=4, minutes=10))

current_date    2022-07-29 09:28:22.108203
plus            14 days, 4:10:00
total           2022-08-15 13:38:22.108203


In [162]:
current_datetime

datetime.datetime(2022, 7, 29, 9, 28, 22, 108203)

In [164]:
pd.to_datetime("21.07.1980")

Timestamp('1980-07-21 00:00:00')

In [163]:
datetime.now() - pd.to_datetime("21.07.1980")

Timedelta('15348 days 10:02:17.609979')

In [165]:
datetime.now() - pd.to_datetime("31.05.1996")

Timedelta('9555 days 10:03:03.691750')

### ``strftime()``

**Converting** from date/datetime/timedelta object **to string type** [SOURCE](https://strftime.org/)

In [166]:
print(current_datetime)

2022-07-29 09:28:22.108203


In [167]:
current_datetime.year

2022

In [168]:
type(current_datetime.year)

int

**Watch out the difference.**

In [169]:
current_datetime.strftime("%Y")

'2022'

In [170]:
type(current_datetime.strftime("%Y"))

str

In [171]:
year = current_datetime.strftime("%Y")
print("year:", year)

month = current_datetime.strftime("%m")
print("month:", month)

day = current_datetime.strftime("%d")
print("day:", day)

time = current_datetime.strftime("%H:%M:%S")
print("time:", time)

date_time = current_datetime.strftime("%m/%d/%Y, %H:%M:%S")
print("date and time:", date_time)

year: 2022
month: 07
day: 29
time: 09:28:22
date and time: 07/29/2022, 09:28:22


In [173]:
## strftime' ile de date'in parçalarını alabiliyoruz.bunu mesela sunum safhasında kullanabiliriz.

### strptime()

**Converting** from string type **to datetime object**

In [174]:
date_string = "21 June, 2018"
date_string

'21 June, 2018'

In [177]:
datetime.strptime(date_string, "%d %B, %Y")

datetime.datetime(2018, 6, 21, 0, 0)

In [180]:
# datetime.strptime(date_string, "%d %B %Y")   # VİRGÜLÜ KOYMAZSAK MESELA HATA ALIRIZ.

In [181]:
## BUNUN AYNISINI ŞUU ŞEKİLDE YAPABİLRİDK VE ÇOK DAHA PRATİK OLURDU.

In [182]:
pd.to_datetime(date_string)

Timestamp('2018-06-21 00:00:00')

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Operation with Datetime Object</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Let's detect the time between first order date and entry date for each product
## Her ürün için ilk sipariş tarihi ile giriş tarihi arasındaki süreyi tespit edelim

In [183]:
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date
0,401,2021-01-23,1.0,541.487603,2018-12-04
1,416,2020-04-02,1.0,131.181818,2018-12-04
2,717,2019-03-10,1.0,2035.488500,2018-12-04
3,778,2019-12-27,1.0,335.988000,2018-12-04
4,826,2020-02-19,1.0,342.292302,2018-12-04
...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07
908,1536887,2020-11-22,1.0,0.000000,2020-11-13
909,1536952,2021-01-26,1.0,988.429752,2020-11-24


In [184]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


**Let us do it by string methods**

In [185]:
df["order_date"] - df["entry_date"]

0     781 days
1     485 days
2      96 days
3     388 days
4     442 days
        ...   
906    48 days
907    48 days
908     9 days
909    63 days
910    10 days
Length: 911, dtype: timedelta64[ns]

In [186]:
df["time_delta"] = df["order_date"] - df["entry_date"]
df    

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta
0,401,2021-01-23,1.0,541.487603,2018-12-04,781 days
1,416,2020-04-02,1.0,131.181818,2018-12-04,485 days
2,717,2019-03-10,1.0,2035.488500,2018-12-04,96 days
3,778,2019-12-27,1.0,335.988000,2018-12-04,388 days
4,826,2020-02-19,1.0,342.292302,2018-12-04,442 days
...,...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48 days
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48 days
908,1536887,2020-11-22,1.0,0.000000,2020-11-13,9 days
909,1536952,2021-01-26,1.0,988.429752,2020-11-24,63 days


In [187]:
## time_deltayla işlemler yapacağım için bu sütunu sayısal bir değere çevirmem gerekiyor.

In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype          
---  ------            --------------  -----          
 0   id_product        911 non-null    int64          
 1   order_date        911 non-null    datetime64[ns] 
 2   product_quantity  911 non-null    float64        
 3   product_price     911 non-null    float64        
 4   entry_date        911 non-null    datetime64[ns] 
 5   time_delta        911 non-null    timedelta64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), timedelta64[ns](1)
memory usage: 42.8 KB


In [190]:
# df["time_delta"].str.split()  # böyle yaparsak hata alırız.

In [192]:
df["time_delta"].astype("str").str.split()

0      [781, days]
1      [485, days]
2       [96, days]
3      [388, days]
4      [442, days]
          ...     
906     [48, days]
907     [48, days]
908      [9, days]
909     [63, days]
910     [10, days]
Name: time_delta, Length: 911, dtype: object

In [193]:
df["time_delta"].astype("str").str.split().str[0]

0      781
1      485
2       96
3      388
4      442
      ... 
906     48
907     48
908      9
909     63
910     10
Name: time_delta, Length: 911, dtype: object

In [195]:
df["time_delta"].astype("str").str.split().str[0].astype("int")

0      781
1      485
2       96
3      388
4      442
      ... 
906     48
907     48
908      9
909     63
910     10
Name: time_delta, Length: 911, dtype: int32

In [196]:
## çünkü sayısal işlemler yapacağım string olarak kalmasını istemiyorum.

In [197]:
df["time_delta"] = df["time_delta"].astype("str").str.split().str[0].astype("int")
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta
0,401,2021-01-23,1.0,541.487603,2018-12-04,781
1,416,2020-04-02,1.0,131.181818,2018-12-04,485
2,717,2019-03-10,1.0,2035.488500,2018-12-04,96
3,778,2019-12-27,1.0,335.988000,2018-12-04,388
4,826,2020-02-19,1.0,342.292302,2018-12-04,442
...,...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48
908,1536887,2020-11-22,1.0,0.000000,2020-11-13,9
909,1536952,2021-01-26,1.0,988.429752,2020-11-24,63


In [199]:
df.groupby("id_product")["time_delta"].min() 

## bunu yaptık çünkü bir ürün birden fazla kez satılmış olabilir.

id_product
401        781
416        485
717         96
778        388
826        442
          ... 
1536841     38
1536842     48
1536887      9
1536952     63
1536974     10
Name: time_delta, Length: 498, dtype: int32

## ama bu direk sütun olarak yazılamaz çünkü satır sayıları uyuşmuyor orijinalde 911 satır var bu doğal olarak daha az bir uzunluk döndürdü 498 biz burada işte transform kullanacağız.

In [200]:
df.groupby("id_product")["time_delta"].transform(min)

0      781
1      485
2       96
3      388
4      442
      ... 
906     48
907     48
908      9
909     63
910     10
Name: time_delta, Length: 911, dtype: int32

## işte şimdi doğru oldu.  aynı ürünlerin yanına aynı değerleri yazacak.

In [201]:
df["passing_time_to_firstsale"] = df.groupby("id_product")["time_delta"].transform(min)
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta,passing_time_to_firstsale
0,401,2021-01-23,1.0,541.487603,2018-12-04,781,781
1,416,2020-04-02,1.0,131.181818,2018-12-04,485,485
2,717,2019-03-10,1.0,2035.488500,2018-12-04,96,96
3,778,2019-12-27,1.0,335.988000,2018-12-04,388,388
4,826,2020-02-19,1.0,342.292302,2018-12-04,442,442
...,...,...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48
908,1536887,2020-11-22,1.0,0.000000,2020-11-13,9,9
909,1536952,2021-01-26,1.0,988.429752,2020-11-24,63,63


![image.png](attachment:image.png)

## Let's detect the time between last order date and today for each product

bu ürün ne kadar zamandır satılmıyor ?

**This time, let us do it by datetime properties**

In [203]:
df.groupby("id_product").order_date.max()

id_product
401       2021-01-23
416       2020-04-02
717       2019-03-10
778       2019-12-27
826       2020-02-19
             ...    
1536841   2020-11-22
1536842   2020-11-24
1536887   2020-11-22
1536952   2021-01-26
1536974   2020-12-06
Name: order_date, Length: 498, dtype: datetime64[ns]

In [204]:
df.groupby("id_product").order_date.transform(max).dt.date

0      2021-01-23
1      2020-04-02
2      2019-03-10
3      2019-12-27
4      2020-02-19
          ...    
906    2020-11-24
907    2020-11-24
908    2020-11-22
909    2021-01-26
910    2020-12-06
Name: order_date, Length: 911, dtype: object

In [205]:
last_order_date = df.groupby("id_product").order_date.transform(max).dt.date 

In [206]:
today = pd.to_datetime("27-02-2021", format='%d-%m-%Y').date()
print(today)

2021-02-27


In [207]:
df["passing_time_from_lastsale"] = today - last_order_date

In [208]:
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta,passing_time_to_firstsale,passing_time_from_lastsale
0,401,2021-01-23,1.0,541.487603,2018-12-04,781,781,35 days
1,416,2020-04-02,1.0,131.181818,2018-12-04,485,485,331 days
2,717,2019-03-10,1.0,2035.488500,2018-12-04,96,96,720 days
3,778,2019-12-27,1.0,335.988000,2018-12-04,388,388,428 days
4,826,2020-02-19,1.0,342.292302,2018-12-04,442,442,374 days
...,...,...,...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48,95 days
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48,95 days
908,1536887,2020-11-22,1.0,0.000000,2020-11-13,9,9,97 days
909,1536952,2021-01-26,1.0,988.429752,2020-11-24,63,63,32 days


In [209]:
df["passing_time_from_lastsale"] = df["passing_time_from_lastsale"].astype("str").str.split(" ").str[0].astype(int)
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta,passing_time_to_firstsale,passing_time_from_lastsale
0,401,2021-01-23,1.0,541.487603,2018-12-04,781,781,35
1,416,2020-04-02,1.0,131.181818,2018-12-04,485,485,331
2,717,2019-03-10,1.0,2035.488500,2018-12-04,96,96,720
3,778,2019-12-27,1.0,335.988000,2018-12-04,388,388,428
4,826,2020-02-19,1.0,342.292302,2018-12-04,442,442,374
...,...,...,...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48,95
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07,48,48,95
908,1536887,2020-11-22,1.0,0.000000,2020-11-13,9,9,97
909,1536952,2021-01-26,1.0,988.429752,2020-11-24,63,63,32
