<a href="https://colab.research.google.com/github/RafsanSwadhin/NLP-Practice/blob/master/tokenization_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy


In [2]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
data_websites = [token.text for token in doc if token.like_url ]
data_websites

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

In [7]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 ‚Ç¨ to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text, doc[token.i+1].text)

two $
500 ‚Ç¨


Great! Your code is short and efficient. Let me explain it **step by step** with your example:

---

### ‚úÖ Your Code:

```python
transactions = "Tony gave two $ to Peter, Bruce gave 500 ‚Ç¨ to Steve"
doc = nlp(transactions)

for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text, doc[token.i+1].text)
```

---

### ‚úÖ Step-by-step Explanation:

#### 1. **Input Text**:

```python
transactions = "Tony gave two $ to Peter, Bruce gave 500 ‚Ç¨ to Steve"
```

This is the sentence you're analyzing.

#### 2. **Processing with spaCy**:

```python
doc = nlp(transactions)
```

You're creating a `Doc` object. spaCy breaks the sentence into tokens and analyzes each one (POS, shape, entity, etc.).

#### 3. **Looping through tokens**:

```python
for token in doc:
```

This goes through each word or symbol in the sentence ‚Äî like `"Tony"`, `"gave"`, `"two"`, `"$"` etc.

#### 4. **Condition: `token.like_num`**:

```python
if token.like_num
```

This checks if the token **represents a number**, like `"two"` or `"500"`. `like_num` returns `True` for digits and number words.

#### 5. **Check Next Token: `doc[token.i+1].is_currency`**:

```python
doc[token.i+1].is_currency
```

This checks if the **next token** is a currency symbol, like `"$"` or `"‚Ç¨"`.
Note: `token.i` is the index of the current token.

#### 6. **Print Amount and Currency**:

```python
print(token.text, doc[token.i+1].text)
```

If both conditions are true, it prints the amount and the currency.

---

### ‚úÖ Output:

```
two $
500 ‚Ç¨
```

---

### ‚ö†Ô∏è Caution:

* If the **currency comes before** the number (e.g., `$500`), your code will **not detect it**.
* You may want to add a check for that case too if needed.



To handle **both cases**:

1. Number **before** currency (‚úÖ your current logic): `"500 ‚Ç¨"`
2. Currency **before** number (‚ö†Ô∏è missing): `"$ 500"`

We can modify your code slightly to check both **`num + currency`** and **`currency + num`** patterns:

---

### ‚úÖ Updated Code:

```python
import spacy

nlp = spacy.load("en_core_web_sm")
transactions = "Tony gave two $ to Peter, Bruce gave $ 500 to Steve and also gave 700 ‚Ç¨ to John"

doc = nlp(transactions)

for i in range(len(doc) - 1):
    # Case 1: number followed by currency (e.g., "500 ‚Ç¨")
    if doc[i].like_num and doc[i+1].is_currency:
        print(doc[i].text, doc[i+1].text)

    # Case 2: currency followed by number (e.g., "$ 500")
    if doc[i].is_currency and doc[i+1].like_num:
        print(doc[i+1].text, doc[i].text)
```

---

### üí° Example Input:

```python
"Tony gave two $ to Peter, Bruce gave $ 500 to Steve and also gave 700 ‚Ç¨ to John"
```

### ‚úÖ Output:

```
two $
500 $
700 ‚Ç¨
```

---

Let me know if you'd like to also **normalize number words** (like `"two"` ‚Üí `2`) using `text2num`.


In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
transactions = "Tony gave two $ to Peter, Bruce gave $ 500 to Steve and also gave 700 ‚Ç¨ to John"

doc = nlp(transactions)

for i in range(len(doc) - 1):
    # Case 1: number followed by currency (e.g., "500 ‚Ç¨")
    if doc[i].like_num and doc[i+1].is_currency:
        print(doc[i].text, doc[i+1].text)

    # Case 2: currency followed by number (e.g., "$ 500")
    if doc[i].is_currency and doc[i+1].like_num:
        print(doc[i+1].text, doc[i].text)


two $
500 $
700 ‚Ç¨
